These pages track pre-release signals, not confirmed launch data. Provider announcements and official model cards are stronger evidence than mentions in research papers, and pricing, benchmarks, and availability remain unconfirmed until release.

Available

Mistral-7B-Instruct-v0.3

Name: Mistral-7B-Instruct-v0.3
Author: Mistral AI

Language Model by Mistral AI

What We Know

Mistral-7B-Instruct-v0.3 achieves 62.6% on MMLU while running at $0.20/$0.23 per million tokens (input/output), making it 4.4x cheaper than GPT-4 despite a 25.7 percentage point performance gap. The model introduces Mistral\'s v3 tokenizer with an extended 32,768-token vocabulary and native function calling support, positioning it as a cost-effective alternative for applications requiring moderate reasoning capabilities. With grouped-query attention (GQA) and sliding window attention (SWA) optimizations, it delivers inference speeds suitable for edge deployment while maintaining Apache 2.0 licensing for commercial use.

Provider

Mistral AI

Benchmark Performance

Benchmark	Mistral-7B-Instruct-v0.3	Comparison
MMLU	62.6%	88.3%
HumanEval	30.5%	62.2%
GSM8K	42.2%	79.6%
ARC-Challenge	63.9%	96.7%
HellaSwag	84.8%	95.3%
TruthfulQA	59.5%	59%
WinoGrande	78.4%	87.5%
BBH	25.57%	-
MMLU Pro	23.06%	-
GPQA	3.91%	-
IFEval	54.65%	88%
Artificial Analysis Intelligence Index	7%	12%

Pricing

Input

$0.20

per 1M tokens

Output

$0.23

per 1M tokens

Capabilities & Features

codingreasoningfunction_callinginstruction_followinglong_contextExtended vocabulary to 32,768 tokensVersion 3 Tokenizer supportFunction calling capabilitiesGrouped-query attention (GQA) for faster inferenceSliding window attention (SWA)Apache 2.0 license

Timeline

September 27, 2023

Mistral 7B base model released

May 22, 2024

Mistral-7B-Instruct-v0.3 released

Resources

Primary source →Mistral AI Official Announcement →HuggingFace Model Page →Artificial Analysis Intelligence Rating →LLM Explorer Benchmarks →

Verification Status

Mistral-7B-Instruct-v0.3 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.

More from Mistral AI

All Mistral AI Models Compare AI Models API Pricing

Frequently Asked Questions

Mistral-7B-Instruct-v0.3 scores 30.5% on HumanEval, significantly trailing Llama 3 8B's 62.2% - a 31.7 percentage point deficit despite similar parameter counts. The model also underperforms on mathematical reasoning with 42.2% on GSM8K versus Llama 3 8B's 79.6%. These benchmarks indicate the model is better suited for general text tasks than specialized coding or mathematical applications.

At $0.20 per million input tokens and $0.23 per million output tokens, Mistral-7B-Instruct-v0.3 costs approximately 200x less than GPT-4 Turbo ($30-$60 per million tokens). For a typical chatbot processing 10 million tokens daily (7M input, 3M output), you'd pay $2.09/day with Mistral versus $420/day with GPT-4 Turbo. This 201x cost reduction comes with a 25.7 percentage point drop in MMLU performance (62.6% vs 88.3%).

The v0.3 update extends vocabulary from 32,000 to 32,768 tokens and introduces native function calling support through special tokens. With a 32K context window, the model can process approximately 24,000 words in a single prompt - equivalent to 40-50 pages of text. The grouped-query attention (GQA) architecture reduces memory usage by 8x during inference compared to standard multi-head attention, enabling deployment on consumer GPUs with 16GB VRAM.

Key performance gaps include a 54.65% score on instruction-following (IFEval) versus Claude 3.5's 88%, indicating 33.35 percentage points lower accuracy in complex multi-step instructions. The model scores only 3.91% on GPQA (graduate-level reasoning) and 25.57% on BBH (hard reasoning tasks). Applications requiring strong logical reasoning, complex code generation, or nuanced instruction following will likely need prompt engineering adjustments or should consider larger models.

Mistral-7B-Instruct-v0.3 achieves 59.5% on TruthfulQA, slightly outperforming GPT-4's 59% - one of the few benchmarks where it matches frontier models. However, it scores 63.9% on ARC-Challenge (scientific reasoning) versus Claude 3.5's 96.7%, showing a 32.8 percentage point gap. The model performs best on common-sense reasoning with 84.8% on HellaSwag, trailing GPT-4 by only 10.5 percentage points.

What We Know

Provider

Mistral AI