Language Model by Mistral AI
Mistral Small 3.2 delivers a 120% improvement on Arena Hard v2 (43.1 vs 19.56) compared to version 3.1, marking the largest single-version performance jump in Mistral\'s history. This 24B parameter dense transformer achieves 92.9% on HumanEval Plus while maintaining deployment on a single A100 80GB GPU, positioning it as the most capable model under 30B parameters for production workloads. Released June 2025 with Apache 2.0 licensing, the model reduces infinite generation loops by 39% (1.29% vs 2.11%) through architectural refinements to attention mechanisms, while adding multimodal vision capabilities without increasing inference costs.
| Benchmark | Mistral Small 3.2 | Comparison |
|---|---|---|
| MMLU | 80.5% | 80.62% |
| HumanEval Plus | 92.9% | 88.99% |
| MBPP Pass@5 | 78.33% | 74.63% |
| Arena Hard v2 | 43.1% | 19.56% |
| Wildbench v2 | 65.33% | 55.6% |
| Instruction Following Accuracy | 84.78% | 82.75% |
| MMLU Pro | 69.06% | 66.76% |
| GPQA Diamond | 46.13% | 45.96% |
| ChartQA | 87.4% | 86.24% |
| DocVQA | 94.86% | 94.08% |
| Artificial Analysis Intelligence Index | 15% | 11% |
Mistral Small 3.1 released with multimodal capabilities
Mistral Small 3.2 released as stability-focused update
Made available via Mistral API (mistral-small-2506)
Mistral Small 3.2 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Mistral Small 3.2 achieves 92.9% on HumanEval Plus (up from 88.99% in v3.1) and 78.33% on MBPP Pass@5 (up from 74.63%), representing 3.91 and 3.7 percentage point improvements respectively. These scores place it ahead of GPT-3.5 and competitive with early GPT-4 variants on coding tasks, while requiring only 24B parameters. The model particularly excels at multi-step reasoning required for complex programming problems, as evidenced by its 43.1 score on Arena Hard v2.
At $0.075-$0.10 per million input tokens and $0.20-$0.30 per million output tokens, Mistral Small 3.2 costs approximately 3-4x less than GPT-4o while offering comparable performance on many benchmarks. For a typical RAG application processing 10M tokens daily (8M input, 2M output), monthly costs range from $720-$1,200 compared to $3,000+ for GPT-4o. The model supports 128K context windows without additional pricing tiers, unlike many competitors that charge premium rates above 32K tokens.
Version 3.2 reduces infinite generation loops from 2.11% to 1.29% through modifications to the attention mechanism's positional encoding and enhanced dropout patterns during fine-tuning. The model implements a refined instruction-following accuracy of 84.78% (up from 82.75%) by incorporating adversarial training examples specifically targeting repetition patterns. These improvements are particularly visible in long-form generation tasks where v3.1 would occasionally enter loops after 2,000+ tokens.
With scores of 87.4% on ChartQA and 94.86% on DocVQA, Mistral Small 3.2 demonstrates strong document understanding capabilities, though it lacks GPT-4V's general image reasoning abilities. The model excels at structured visual data extraction (charts, tables, forms) but shows limitations on natural image understanding tasks. For document-centric workflows requiring OCR and layout understanding at scale, it offers a cost-effective alternative processing images at the same token rates as text.
On a single A100 80GB, Mistral Small 3.2 achieves approximately 45-50 tokens/second for batch size 1 inference, with first-token latency around 200ms. H100 deployment improves throughput to 70-75 tokens/second with FP8 quantization. The 24B parameter count allows full model loading without sharding, eliminating inter-GPU communication overhead that impacts larger models. Memory usage peaks at 65GB with KV cache for 16K context, leaving headroom for larger batches.