Best AI Models for Math

AI models ranked by mathematical reasoning using MATH-500, GSM8K, and AIME 2024 benchmark scores.

Last updated: 10m ago

#1 Model

Score: 88.2

Average Score

71.8

Across all ranked models

Models Ranked

With benchmark data

Weights:MATH-500 (40%)GSM8K (30%)AIME 2024 (30%)

Top Best for Math Models by Weighted Score

Top 15 models by weighted score

LMMarketCap.com

Benchmark Breakdown

Per-benchmark scores for top 10 models

MATH-500

GSM8K

AIME 2024

LMMarketCap.com

#	Model	Provider	Score	MATH-500	GSM8K	AIME 2024
1	o3OpenAI	OpenAI	88.2	99	--	96.7
2	o4 MiniOpenAI	OpenAI	86.1	97.3	--	93.4
3	Gemini 2.5 ProGoogle	Google	84.4	95.2	--	92
4	o3 MiniOpenAI	OpenAI	84	97.9	--	87.3
5	Grok 4xAI	xAI	82.8	95	--	88
6	R1 0528DeepSeek	DeepSeek	82.7	97.8	--	84
7	Gemini 2.0 FlashGoogle	Google	82.3	89.7	93.8	--
8	o1OpenAI	OpenAI	81.7	96.4	--	83.3
9	R1DeepSeek	DeepSeek	80.8	97.3	--	79.8
10	Gemini 2.5 FlashGoogle	Google	78.1	85.8	--	88
11	DeepSeek V3DeepSeek	DeepSeek	76.9	90.2	96.7	39.2
12	Claude Opus 4.6Anthropic	Anthropic	76.6	90.5	--	78
13	GPT-5.4OpenAI	OpenAI	76.4	95.5	--	--
14	GPT-4oOpenAI	OpenAI	76.3	76.6	95.8	--
15	GPT-5.2OpenAI	OpenAI	75.2	94	--	--
16	GPT-5.1OpenAI	OpenAI	74.8	93.5	--	--
17	GPT-5OpenAI	OpenAI	74	92.5	--	--
18	GPT-4 TurboOpenAI	OpenAI	73.7	72.6	94.2	--
19	DeepSeek V3 0324DeepSeek	DeepSeek	73.6	92	--	--
20	Claude Opus 4.5Anthropic	Anthropic	73.1	88.1	--	72
21	GPT-4o-miniOpenAI	OpenAI	72.1	70.2	93.2	--
22	Llama 3.1 70B InstructMeta	Meta	71.7	68	95.1	--
23	Gemma 4 31BGoogle	Google	71.4	--	--	89.2
24	Claude Opus 4Anthropic	Anthropic	70.5	86	--	68
25	Gemini 3 Flash PreviewGoogle	Google	70.4	88	--	--
26	Claude Sonnet 4.6Anthropic	Anthropic	68.9	85.3	--	65
27	Gemma 2 27BGoogle	Google	68.2	--	85.3	--
28	Claude Sonnet 4.5Anthropic	Anthropic	65.8	83	--	60
29	Grok 3xAI	xAI	64.9	85	--	55
30	Llama 4 MaverickMeta	Meta	64.8	81	--	--
31	Phi 4Microsoft	Microsoft	64.3	80.4	--	--
32	Claude Sonnet 4Anthropic	Anthropic	64.2	81.4	--	58
33	Claude 3.7 SonnetAnthropic	Anthropic	63.5	82.2	--	55
34	GPT-4.1OpenAI	OpenAI	62.8	78.5	--	--
35	Llama 3.3 70B InstructMeta	Meta	61.6	77	--	--
36	Mistral LargeMistral AI	Mistral AI	60.8	76	--	--
37	Claude Haiku 4.5Anthropic	Anthropic	58	72.5	--	--
38	Claude 3.5 HaikuAnthropic	Anthropic	55.4	69.2	--	--
39	Llama 4 ScoutMeta	Meta	40.2	50.3	--	--

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.