AI models ranked by mathematical reasoning using MATH-500, GSM8K, and AIME 2024 benchmark scores.
o3
Score: 88.2
71.8
Across all ranked models
39
With benchmark data
Top Best for Math Models by Weighted Score
Top 15 models by weighted score
Benchmark Breakdown
Per-benchmark scores for top 10 models
| # | Model | Score |
|---|---|---|
| 1 | o3OpenAI | 88.2 |
| 2 | o4 MiniOpenAI | 86.1 |
| 3 | Gemini 2.5 ProGoogle | 84.4 |
| 4 | o3 MiniOpenAI | 84 |
| 5 | Grok 4xAI | 82.8 |
| 6 | R1 0528DeepSeek | 82.7 |
| 7 | Gemini 2.0 FlashGoogle | 82.3 |
| 8 | o1OpenAI | 81.7 |
| 9 | R1DeepSeek | 80.8 |
| 10 | Gemini 2.5 FlashGoogle | 78.1 |
| 11 | DeepSeek V3DeepSeek | 76.9 |
| 12 | Claude Opus 4.6Anthropic | 76.6 |
| 13 | GPT-5.4OpenAI | 76.4 |
| 14 | GPT-4oOpenAI | 76.3 |
| 15 | GPT-5.2OpenAI | 75.2 |
| 16 | GPT-5.1OpenAI | 74.8 |
| 17 | GPT-5OpenAI | 74 |
| 18 | GPT-4 TurboOpenAI | 73.7 |
| 19 | DeepSeek V3 0324DeepSeek | 73.6 |
| 20 | Claude Opus 4.5Anthropic | 73.1 |
| 21 | GPT-4o-miniOpenAI | 72.1 |
| 22 | Llama 3.1 70B InstructMeta | 71.7 |
| 23 | Gemma 4 31BGoogle | 71.4 |
| 24 | Claude Opus 4Anthropic | 70.5 |
| 25 | Gemini 3 Flash PreviewGoogle | 70.4 |
| 26 | Claude Sonnet 4.6Anthropic | 68.9 |
| 27 | Gemma 2 27BGoogle | 68.2 |
| 28 | Claude Sonnet 4.5Anthropic | 65.8 |
| 29 | Grok 3xAI | 64.9 |
| 30 | Llama 4 MaverickMeta | 64.8 |
| 31 | Phi 4Microsoft | 64.3 |
| 32 | Claude Sonnet 4Anthropic | 64.2 |
| 33 | Claude 3.7 SonnetAnthropic | 63.5 |
| 34 | GPT-4.1OpenAI | 62.8 |
| 35 | Llama 3.3 70B InstructMeta | 61.6 |
| 36 | Mistral LargeMistral AI | 60.8 |
| 37 | Claude Haiku 4.5Anthropic | 58 |
| 38 | Claude 3.5 HaikuAnthropic | 55.4 |
| 39 | Llama 4 ScoutMeta | 40.2 |
Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.
Based on our benchmark analysis, o3 by OpenAI is currently the #1 ranked model for math, with a weighted score of 88.2/100.
Models are ranked using a weighted average of MATH-500, GSM8K, AIME 2024 benchmark scores. All scores are normalized to a 0-100 scale.
We currently rank 39 models that have relevant benchmark data for math tasks.