最佳推理AI模型

使用GPQA、ARC-Challenge、BIG-Bench Hard和Humanity's Last Exam分数对AI模型的推理能力进行排名。

Last updated: 27m ago

第一名模型

GPT-4o

评分: 76.5

平均评分

47.4

所有排名模型

已排名模型

有基准测试数据

权重：GPQA (40%)ARC-Challenge (20%)BIG-Bench Hard (20%)Humanity's Last Exam (20%)

Top Best for Reasoning Models by Weighted Score

Top 15 models by weighted score

LMMarketCap.com

Benchmark Breakdown

Per-benchmark scores for top 10 models

GPQA

ARC-Challenge

BIG-Bench Hard

Humanity's Last Exam

LMMarketCap.com

#	模型	提供商	评分	GPQA	ARC-Challenge	BIG-Bench Hard	Humanity's Last Exam
1	GPT-4oOpenAI	OpenAI	76.5	--	96.4	83.7	--
2	GPT-4o-miniOpenAI	OpenAI	75.1	--	96.4	80.4	--
3	Llama 3.1 70B InstructMeta	Meta	74.8	--	94.8	81.2	--
4	Gemma 2 27BGoogle	Google	72.2	--	93.2	--	--
5	Grok 4xAI	xAI	69.7	--	--	90	--
6	DeepSeek V3 0324DeepSeek	DeepSeek	68.2	--	--	88	--
7	DeepSeek V3DeepSeek	DeepSeek	67.8	--	--	87.5	--
8	R1 0528DeepSeek	DeepSeek	67	--	--	86.5	--
9	Grok 3xAI	xAI	66.6	--	--	86	--
10	R1DeepSeek	DeepSeek	65.9	--	--	85	--
11	GPT-4 TurboOpenAI	OpenAI	64.4	--	--	83.1	--
12	Llama 3.3 70B InstructMeta	Meta	64.2	--	--	82.8	--
13	Gemini 2.0 FlashGoogle	Google	62.4	--	--	80.5	--
14	Mistral LargeMistral AI	Mistral AI	62	--	--	80	--
15	Claude Haiku 4.5Anthropic	Anthropic	60.8	--	--	78.5	--
16	Llama 4 ScoutMeta	Meta	58.9	--	--	76	--
17	GPT-5.4OpenAI	OpenAI	55.7	--	--	92	39
18	Claude Opus 4.6Anthropic	Anthropic	55.1	--	--	91.5	38.2
19	GPT-5.2OpenAI	OpenAI	54.6	--	--	91.5	37
20	GPT-5OpenAI	OpenAI	53.3	--	--	90.5	35
21	Gemini 2.5 ProGoogle	Google	52.4	--	--	88	35.2
22	o3OpenAI	OpenAI	52.3	--	--	93	30.1
23	Gemini 3 Flash PreviewGoogle	Google	52.1	--	--	89	33.7
24	Claude Opus 4.5Anthropic	Anthropic	51.9	--	--	90	32.1
25	Claude Sonnet 4.6Anthropic	Anthropic	51.1	--	--	89.8	30.5
26	Claude Opus 4Anthropic	Anthropic	49.9	--	--	89	28.5
27	Phi 4Microsoft	Microsoft	49.7	20.8	95.5	78	--
28	GPT-5.1OpenAI	OpenAI	48.7	--	--	91	23.7
29	o3 MiniOpenAI	OpenAI	46.2	--	--	88.5	20.3
30	o4 MiniOpenAI	OpenAI	46.1	--	--	90.5	18.1
31	Claude Sonnet 4.5Anthropic	Anthropic	43.4	--	--	88.5	13.7
32	Claude 3.7 SonnetAnthropic	Anthropic	41.5	--	--	89.5	8
33	Gemini 2.5 FlashGoogle	Google	41.3	--	--	85	12.1
34	o1OpenAI	OpenAI	41.3	--	--	89	8.1
35	Gemma 4 31BGoogle	Google	39.9	--	--	74.4	19.5
36	Claude Sonnet 4Anthropic	Anthropic	39.3	--	--	87	5.5
37	Llama 4 MaverickMeta	Meta	38.3	--	--	84.5	5.7
38	GPT-4.1OpenAI	OpenAI	38	--	--	84	5.4
39	Claude Opus 4.7Anthropic	Anthropic	28.1	--	--	--	36.2
40	GPT-5 MiniOpenAI	OpenAI	15.1	--	--	--	19.4
41	Command R7B (12-2024)Cohere	Cohere	14.6	7.8	--	36	--
42	Qwen2.5 7B InstructAlibaba	Alibaba	13	5.5	--	34.9	--
43	Llama 3.1 8B InstructMeta	Meta	12.9	7.4	--	30.9	--
44	Llama 3.2 3B InstructMeta	Meta	10.3	6.2	--	24.2	--
45	Gemini 3.1 Flash LiteGoogle	Google	6.7	--	--	--	8.6
46	Llama 3 8B InstructMeta	Meta	6.4	2.1	--	18.4	--
47	Mistral Medium 3Mistral AI	Mistral AI	3.5	--	--	--	4.5