Language Model by Mistral AI
Mistral-7B-v0.1 achieves 64.1% on MMLU, outperforming Llama 2 13B (59%) despite having 46% fewer parameters, while costing $0.20 per million tokens for both input and output. The model\'s Sliding Window Attention mechanism with a 4K window enables efficient 8K context handling and delivers 2x faster inference than Llama 2 13B on A100 GPUs. Released September 27, 2023 under Apache 2.0 license, it demonstrates particular strength in mathematical reasoning with 47.5% on GSM8K versus Llama 2 7B\'s 26.1% - a 21.4 percentage point improvement.
| Benchmark | Mistral-7B-v0.1 | Comparison |
|---|---|---|
| MMLU | 64.1% | 59% |
| HellaSwag | 78.9% | 78.3% |
| ARC-Challenge | 83.6% | 79.8% |
| WinoGrande | 78.4% | 75.4% |
| GSM8K | 47.5% | 26.1% |
| HumanEval | 31.1% | 11.6% |
| MBPP | 52.5% | 26.1% |
| TruthfulQA | 42.2% | 59% |
| MT-Bench | 6.84% | 6.65% |
| BBH | 22.02% | 88% |
Model weights released
Technical paper published on arXiv
Mistral-7B-v0.1 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Mistral-7B scores 64.1% on MMLU, beating Llama 2 13B's 59% by 5.1 percentage points despite being nearly half the size. On coding benchmarks, it achieves 31.1% on HumanEval (versus Llama 2 7B's 11.6%) and 52.5% on MBPP (versus 26.1%), representing 2.7x and 2x improvements respectively. However, it scores only 42.2% on TruthfulQA compared to GPT-4's 59%, indicating limitations in factual accuracy.
Mistral-7B's Sliding Window Attention (SWA) processes sequences using a 4K token window that slides across the full 8K context, reducing computational complexity from O(n²) to O(n×w) where w is the window size. Combined with Grouped Query Attention (GQA) which shares key-value heads across multiple query heads, this architecture enables 2x faster inference than Llama 2 13B on A100 GPUs. The sliding mechanism allows the model to theoretically attend to information beyond the 4K window through layer stacking, though empirical attention patterns show most focus remains within 2K tokens.
At $0.20 per million tokens for both input and output, Mistral-7B costs 87.5% less than GPT-3.5-Turbo ($0.50 input/$1.50 output) and 99% less than GPT-4 ($30 input/$60 output). For a typical 1000-token prompt with 500-token response, Mistral-7B costs $0.0003 versus GPT-3.5's $0.0008 or GPT-4's $0.06. This pricing positions it competitively against open models while matching performance of models 2x its size on tasks like ARC-Challenge (83.6% vs Llama 2 13B's 79.8%).
Mistral-7B excels at mathematical reasoning (47.5% GSM8K), code generation (31.1% HumanEval), and logical reasoning (83.6% ARC-Challenge), making it suitable for educational tools, code assistants, and analytical applications. Its MT-Bench score of 6.84 beats Llama 2 13B Chat's 6.65, indicating strong instruction-following capabilities. However, with only 22.02% on BBH versus Claude Sonnet 3.5's 88%, it struggles with complex multi-step reasoning tasks requiring extensive world knowledge.
With 7B parameters requiring approximately 14GB in FP16 or 7GB in INT8 quantization, Mistral-7B fits on consumer GPUs like RTX 3090 (24GB) or RTX 4070 Ti (12GB). The Grouped Query Attention reduces KV cache memory by 8x compared to standard multi-head attention, allowing batch sizes of 8-16 on 24GB GPUs at full 8K context. Inference speed reaches 94 tokens/second on A100 80GB, compared to 47 tokens/second for Llama 2 13B.