Mistral-7B-v0.1

Name: Mistral-7B-v0.1
Author: Mistral AI

Language Model by Mistral AI

What We Know

Mistral-7B-v0.1 achieves 64.1% on MMLU, outperforming Llama 2 13B (59%) despite having 46% fewer parameters, while costing $0.20 per million tokens for both input and output. The model\'s Sliding Window Attention mechanism with a 4K window enables efficient 8K context handling and delivers 2x faster inference than Llama 2 13B on A100 GPUs. Released September 27, 2023 under Apache 2.0 license, it demonstrates particular strength in mathematical reasoning with 47.5% on GSM8K versus Llama 2 7B\'s 26.1% - a 21.4 percentage point improvement.

Provider

Mistral AI

Benchmark Performance

Benchmark	Mistral-7B-v0.1	Comparison
MMLU	64.1%	59%
HellaSwag	78.9%	78.3%
ARC-Challenge	83.6%	79.8%
WinoGrande	78.4%	75.4%
GSM8K	47.5%	26.1%
HumanEval	31.1%	11.6%
MBPP	52.5%	26.1%
TruthfulQA	42.2%	59%
MT-Bench	6.84%	6.65%
BBH	22.02%	88%

Pricing

Input

$0.20

per 1M tokens

Output

$0.20

per 1M tokens

Capabilities & Features

codingreasoninginstruction_followingmultilingualSliding Window Attention (SWA) with 4K windowGrouped Query Attention (GQA) for faster inferenceApache 2.0 license2x faster inference than Llama 2 13B on A100Optimized for edge deployment

Timeline

September 27, 2023

Model weights released

October 10, 2023

Technical paper published on arXiv

Resources

Primary source →Mistral 7B Official Announcement →Mistral 7B arXiv Paper →HuggingFace Model Page →Comprehensive Benchmark Analysis →

Verification Status

Mistral-7B-v0.1 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.

More from Mistral AI

All Mistral AI Models Compare AI Models API Pricing

Frequently Asked Questions

Mistral-7B scores 64.1% on MMLU, beating Llama 2 13B's 59% by 5.1 percentage points despite being nearly half the size. On coding benchmarks, it achieves 31.1% on HumanEval (versus Llama 2 7B's 11.6%) and 52.5% on MBPP (versus 26.1%), representing 2.7x and 2x improvements respectively. However, it scores only 42.2% on TruthfulQA compared to GPT-4's 59%, indicating limitations in factual accuracy.

Mistral-7B's Sliding Window Attention (SWA) processes sequences using a 4K token window that slides across the full 8K context, reducing computational complexity from O(n²) to O(n×w) where w is the window size. Combined with Grouped Query Attention (GQA) which shares key-value heads across multiple query heads, this architecture enables 2x faster inference than Llama 2 13B on A100 GPUs. The sliding mechanism allows the model to theoretically attend to information beyond the 4K window through layer stacking, though empirical attention patterns show most focus remains within 2K tokens.

At $0.20 per million tokens for both input and output, Mistral-7B costs 87.5% less than GPT-3.5-Turbo ($0.50 input/$1.50 output) and 99% less than GPT-4 ($30 input/$60 output). For a typical 1000-token prompt with 500-token response, Mistral-7B costs $0.0003 versus GPT-3.5's $0.0008 or GPT-4's $0.06. This pricing positions it competitively against open models while matching performance of models 2x its size on tasks like ARC-Challenge (83.6% vs Llama 2 13B's 79.8%).

Mistral-7B excels at mathematical reasoning (47.5% GSM8K), code generation (31.1% HumanEval), and logical reasoning (83.6% ARC-Challenge), making it suitable for educational tools, code assistants, and analytical applications. Its MT-Bench score of 6.84 beats Llama 2 13B Chat's 6.65, indicating strong instruction-following capabilities. However, with only 22.02% on BBH versus Claude Sonnet 3.5's 88%, it struggles with complex multi-step reasoning tasks requiring extensive world knowledge.

With 7B parameters requiring approximately 14GB in FP16 or 7GB in INT8 quantization, Mistral-7B fits on consumer GPUs like RTX 3090 (24GB) or RTX 4070 Ti (12GB). The Grouped Query Attention reduces KV cache memory by 8x compared to standard multi-head attention, allowing batch sizes of 8-16 on 24GB GPUs at full 8K context. Inference speed reaches 94 tokens/second on A100 80GB, compared to 47 tokens/second for Llama 2 13B.

What We Know

Provider

Mistral AI