Language Model by Microsoft
Phi-4-mini achieves 88.6% on GSM8K (4 percentage points above Phi-3.5-Mini) while maintaining just 3.8B parameters, making it the highest-performing model under 5B parameters for mathematical reasoning. The model leverages grouped-query attention and a 200K vocabulary to support multilingual capabilities across its 128K context window, trained on synthetic data through December 2024. Released March 3, 2025 on Azure AI Foundry and HuggingFace, it outperforms Phi-3.5-Mini on 8 of 12 benchmarks including a 7.3 percentage point gain on BBH, though it underperforms on MBPP by 4.7 points and HellaSwag by 3.1 points.
| Benchmark | Phi-4-mini | Comparison |
|---|---|---|
| GSM8K | 88.6% | 84.6% |
| ARC-Challenge | 83.7% | 84.6% |
| HumanEval | 74.4% | 70.1% |
| HumanEval+ | 68.3% | 62.8% |
| MBPP | 65.3% | 70% |
| MBPP+ | 63.8% | 63.8% |
| MMLU | 67.3% | 65.5% |
| MMLU-Pro | 52.8% | 47.4% |
| BBH (BigBench-Hard) | 70.4% | 63.1% |
| GPQA | 30.4% | 25.2% |
| HellaSwag | 69.1% | 72.2% |
| LiveCodeBench | 19.9% | 15.7% |
| BoolQ | 81.2% | 71.4% |
| OpenBookQA | 79.2% | 72.6% |
| PIQA | 77.6% | 68.2% |
Training began
Training completed
Official release announced on Azure AI Foundry and HuggingFace
Phi-4-mini is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Phi-4-mini shows mixed coding performance: it improves on HumanEval (74.4% vs 70.1%) and HumanEval+ (68.3% vs 62.8%), but drops on MBPP (65.3% vs 70%) while matching MBPP+ at 63.8%. The model excels at LiveCodeBench with 19.9% compared to Phi-3.5-Mini's 15.7%, suggesting better performance on contemporary coding challenges versus traditional benchmarks.
Phi-4-mini implements grouped-query attention within its dense decoder-only transformer architecture, reducing memory overhead compared to full attention mechanisms. The model uses a shared embedding architecture across its 200K vocabulary tokens, optimizing parameter efficiency. This design allows processing 128K tokens while maintaining 3.8B parameters, a 32x increase over typical 4K context windows in similar-sized models.
Phi-4-mini demonstrates substantial reasoning gains over Llama-3.2-3B: BoolQ improves by 9.8 percentage points (81.2% vs 71.4%), OpenBookQA by 6.6 points (79.2% vs 72.6%), and PIQA by 9.4 points (77.6% vs 68.2%). These improvements suggest Phi-4-mini's synthetic training data effectively targets common-sense reasoning tasks despite having only 0.6B more parameters.
Phi-4-mini struggles with graduate-level reasoning (GPQA: 30.4%) and shows regression on MBPP (-4.7 points) and HellaSwag (-3.1 points) versus Phi-3.5-Mini. The ARC-Challenge score also drops slightly to 83.7% from 84.6%. These patterns suggest the model trades some general language understanding for improved mathematical and coding capabilities.
Training began in November 2024 and completed in December 2024, with a knowledge cutoff of June 2024. The model was officially released on March 3, 2025, making it one of the newest entries in the sub-5B parameter class. The 6-month gap between data cutoff and training start follows Microsoft's pattern of extensive data curation for the Phi series.