Best AI Models for Reasoning

AI models ranked by reasoning ability using GPQA, ARC-Challenge, BIG-Bench Hard, and Humanity's Last Exam scores.

Last updated: 39m ago

#1 Model

GPT-4o

Score: 76.5

Average Score

45.9

Across all ranked models

Models Ranked

With benchmark data

Weights:GPQA (40%)ARC-Challenge (20%)BIG-Bench Hard (20%)Humanity's Last Exam (20%)

Top Best for Reasoning Models by Weighted Score

Top 15 models by weighted score

LMMarketCap.com

Benchmark Breakdown

Per-benchmark scores for top 10 models

GPQA

ARC-Challenge

BIG-Bench Hard

Humanity's Last Exam

LMMarketCap.com

#	Model	Provider	Score	GPQA	ARC-Challenge	BIG-Bench Hard	Humanity's Last Exam
1	GPT-4oOpenAI	OpenAI	76.5	--	96.4	83.7	--
2	GPT-4o-miniOpenAI	OpenAI	75.1	--	96.4	80.4	--
3	Llama 3.1 70B InstructMeta	Meta	74.8	--	94.8	81.2	--
4	Gemma 2 27BGoogle	Google	72.2	--	93.2	--	--
5	DeepSeek V3 0324DeepSeek	DeepSeek	68.2	--	--	88	--
6	DeepSeek V3DeepSeek	DeepSeek	67.8	--	--	87.5	--
7	R1 0528DeepSeek	DeepSeek	67	--	--	86.5	--
8	R1DeepSeek	DeepSeek	65.9	--	--	85	--
9	GPT-4 TurboOpenAI	OpenAI	64.4	--	--	83.1	--
10	Llama 3.3 70B InstructMeta	Meta	64.2	--	--	82.8	--
11	Mistral LargeMistral AI	Mistral AI	62	--	--	80	--
12	Claude Haiku 4.5Anthropic	Anthropic	60.8	--	--	78.5	--
13	Llama 4 ScoutMeta	Meta	58.9	--	--	76	--
14	GPT-5.4OpenAI	OpenAI	55.7	--	--	92	39
15	Claude Opus 4.6Anthropic	Anthropic	55.1	--	--	91.5	38.2
16	GPT-5.2OpenAI	OpenAI	54.6	--	--	91.5	37
17	GPT-5OpenAI	OpenAI	53.3	--	--	90.5	35
18	Gemini 2.5 ProGoogle	Google	52.4	--	--	88	35.2
19	o3OpenAI	OpenAI	52.3	--	--	93	30.1
20	Gemini 3 Flash PreviewGoogle	Google	52.1	--	--	89	33.7
21	Claude Opus 4.5Anthropic	Anthropic	51.9	--	--	90	32.1
22	Claude Sonnet 4.6Anthropic	Anthropic	51.1	--	--	89.8	30.5
23	Claude Opus 4Anthropic	Anthropic	49.9	--	--	89	28.5
24	Phi 4Microsoft	Microsoft	49.7	20.8	95.5	78	--
25	GPT-5.1OpenAI	OpenAI	48.7	--	--	91	23.7
26	o3 MiniOpenAI	OpenAI	46.2	--	--	88.5	20.3
27	o4 MiniOpenAI	OpenAI	46.1	--	--	90.5	18.1
28	Claude Fable 5Anthropic	Anthropic	45.7	--	--	--	59
29	Claude Sonnet 4.5Anthropic	Anthropic	43.4	--	--	88.5	13.7
30	Gemini 2.5 FlashGoogle	Google	41.3	--	--	85	12.1
31	o1OpenAI	OpenAI	41.3	--	--	89	8.1
32	Gemma 4 31BGoogle	Google	39.9	--	--	74.4	19.5
33	Claude Sonnet 4Anthropic	Anthropic	39.3	--	--	87	5.5
34	Claude Opus 4.8Anthropic	Anthropic	38.6	--	--	--	49.8
35	Llama 4 MaverickMeta	Meta	38.3	--	--	84.5	5.7
36	GPT-4.1OpenAI	OpenAI	38	--	--	84	5.4
37	Claude Opus 4.7Anthropic	Anthropic	36.3	--	--	--	46.9
38	GPT-5.5OpenAI	OpenAI	32.1	--	--	--	41.4
39	GPT-5 MiniOpenAI	OpenAI	15.1	--	--	--	19.4
40	Command R7B (12-2024)Cohere	Cohere	14.6	7.8	--	36	--
41	Qwen2.5 7B InstructAlibaba	Alibaba	13	5.5	--	34.9	--
42	Llama 3.1 8B InstructMeta	Meta	12.9	7.4	--	30.9	--
43	Llama 3.2 3B InstructMeta	Meta	10.3	6.2	--	24.2	--
44	Gemini 3.1 Flash LiteGoogle	Google	6.7	--	--	--	8.6
45	Llama 3 8B InstructMeta	Meta	6.4	2.1	--	18.4	--
46	Mistral Medium 3Mistral AI	Mistral AI	3.5	--	--	--	4.5

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.

Other Specialty Leaderboards

Best for Coding Best for Math Best for Writing Best for Instructions Best for Data Analysis Best for Roleplay Best for Multilingual

Frequently Asked Questions

Based on our benchmark analysis, GPT-4o by OpenAI is currently the #1 ranked model for reasoning, with a weighted score of 76.5/100.

Models are ranked using a weighted average of GPQA, ARC-Challenge, BIG-Bench Hard, Humanity's Last Exam benchmark scores. All scores are normalized to a 0-100 scale.

We currently rank 46 models that have relevant benchmark data for reasoning tasks.