Language Model by Mistral AI
Mistral-7B-Instruct-v0.3 achieves 62.6% on MMLU while running at $0.20/$0.23 per million tokens (input/output), making it 4.4x cheaper than GPT-4 despite a 25.7 percentage point performance gap. The model introduces Mistral\'s v3 tokenizer with an extended 32,768-token vocabulary and native function calling support, positioning it as a cost-effective alternative for applications requiring moderate reasoning capabilities. With grouped-query attention (GQA) and sliding window attention (SWA) optimizations, it delivers inference speeds suitable for edge deployment while maintaining Apache 2.0 licensing for commercial use.
| Benchmark | Mistral-7B-Instruct-v0.3 | Comparison |
|---|---|---|
| MMLU | 62.6% | 88.3% |
| HumanEval | 30.5% | 62.2% |
| GSM8K | 42.2% | 79.6% |
| ARC-Challenge | 63.9% | 96.7% |
| HellaSwag | 84.8% | 95.3% |
| TruthfulQA | 59.5% | 59% |
| WinoGrande | 78.4% | 87.5% |
| BBH | 25.57% | - |
| MMLU Pro | 23.06% | - |
| GPQA | 3.91% | - |
| IFEval | 54.65% | 88% |
| Artificial Analysis Intelligence Index | 7% | 12% |
Mistral 7B base model released
Mistral-7B-Instruct-v0.3 released
Mistral-7B-Instruct-v0.3 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Mistral-7B-Instruct-v0.3 scores 30.5% on HumanEval, significantly trailing Llama 3 8B's 62.2% - a 31.7 percentage point deficit despite similar parameter counts. The model also underperforms on mathematical reasoning with 42.2% on GSM8K versus Llama 3 8B's 79.6%. These benchmarks indicate the model is better suited for general text tasks than specialized coding or mathematical applications.
At $0.20 per million input tokens and $0.23 per million output tokens, Mistral-7B-Instruct-v0.3 costs approximately 200x less than GPT-4 Turbo ($30-$60 per million tokens). For a typical chatbot processing 10 million tokens daily (7M input, 3M output), you'd pay $2.09/day with Mistral versus $420/day with GPT-4 Turbo. This 201x cost reduction comes with a 25.7 percentage point drop in MMLU performance (62.6% vs 88.3%).
The v0.3 update extends vocabulary from 32,000 to 32,768 tokens and introduces native function calling support through special tokens. With a 32K context window, the model can process approximately 24,000 words in a single prompt - equivalent to 40-50 pages of text. The grouped-query attention (GQA) architecture reduces memory usage by 8x during inference compared to standard multi-head attention, enabling deployment on consumer GPUs with 16GB VRAM.
Key performance gaps include a 54.65% score on instruction-following (IFEval) versus Claude 3.5's 88%, indicating 33.35 percentage points lower accuracy in complex multi-step instructions. The model scores only 3.91% on GPQA (graduate-level reasoning) and 25.57% on BBH (hard reasoning tasks). Applications requiring strong logical reasoning, complex code generation, or nuanced instruction following will likely need prompt engineering adjustments or should consider larger models.
Mistral-7B-Instruct-v0.3 achieves 59.5% on TruthfulQA, slightly outperforming GPT-4's 59% - one of the few benchmarks where it matches frontier models. However, it scores 63.9% on ARC-Challenge (scientific reasoning) versus Claude 3.5's 96.7%, showing a 32.8 percentage point gap. The model performs best on common-sense reasoning with 84.8% on HellaSwag, trailing GPT-4 by only 10.5 percentage points.