AI基准测试 - LLM性能评分

Last updated: 1h ago

来自官方模型卡片和第三方评估的真实基准测试分数。比较162个模型在21个基准测试中的表现--从MMLU和GPQA Diamond到SWE-bench和Arena Elo。按类别、模型类型筛选，在图表和矩阵视图之间切换。

基准测试类别

直接跳转到最强的基准测试集群，而不是从完整矩阵开始。

knowledge

3 个基准测试

Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.

reasoning

5 个基准测试

One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.

math

3 个基准测试

Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.

coding

6 个基准测试

The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.

instruction

2 个基准测试

Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.

arena

2 个基准测试

The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.

Model type:

Massive Multitask Language Understanding

Tests broad knowledge across 57 academic subjects (STEM, humanities, social sciences) with 16,000 multiple-choice questions. The most widely-cited LLM benchmark.

Metric: accuracy %Human baseline: 89.8%Saturated

Why it matters

Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.

MMLU Scores (53 models)

Standard Reasoning Hybrid

LMMarketCap.com

#	Model	Provider	Type	Score	Tier
1	🥇GPT-5.4	OpenAI	standard	94.0%	Elite
2	🥈GPT-5.2	OpenAI	standard	93.5%	Elite
3	🥉GPT-5.1	OpenAI	standard	93.2%	Elite
4	GPT-5	OpenAI	standard	93.0%	Elite
5	Gemini 3.1 Pro Preview	Google	hybrid	92.6%	Elite
6	Gemini 3.1 Pro	Google	hybrid	92.6%	Elite
7	Gemini 3 Pro	Google	hybrid	92.5%	Elite
8	GPT-5.5	OpenAI	standard	92.4%	Elite
9	GPT-5.5 Pro	OpenAI	standard	92.4%	Elite
10	o3	OpenAI	reasoning	92.3%	Elite
11	Claude Opus 4.6	Anthropic	hybrid	92.1%	Elite
12	o1	OpenAI	reasoning	91.8%	Elite
13	DeepSeek R1-0528	DeepSeek	reasoning	91.5%	Elite
14	Grok 4	xAI	reasoning	91.5%	Elite
15	Claude Opus 4.5	Anthropic	hybrid	91.4%	Elite
16	Claude Sonnet 4.6	Anthropic	hybrid	91.2%	Elite
17	Claude Opus 4	Anthropic	hybrid	91.0%	Elite
18	Gemini 2.5 Pro	Google	hybrid	90.8%	Elite
19	DeepSeek R1	DeepSeek	reasoning	90.8%	Elite
20	Claude Sonnet 4.5	Anthropic	hybrid	90.8%	Elite
21	o1 Preview	OpenAI	reasoning	90.8%	Elite
22	Claude 3.7 Sonnet	Anthropic	hybrid	90.2%	Elite
23	Claude Sonnet 4	Anthropic	hybrid	89.5%	Strong
24	GPT-4.1	OpenAI	standard	89.2%	Strong
25	DeepSeek V3 (March 2025)	DeepSeek	standard	89.2%	Strong
26	GPT-4o	OpenAI	standard	88.7%	Strong
27	Claude 3.5 Sonnet	Anthropic	standard	88.7%	Strong
28	Llama 3.1 405B	Meta	standard	88.6%	Strong
29	DeepSeek V3	DeepSeek	standard	88.5%	Strong
30	Grok 3	xAI	standard	88.5%	Strong
31	Llama 4 Maverick	Meta	standard	88.0%	Strong
32	Gemini 3 Flash	Google	hybrid	88.0%	Strong
33	Grok 2	xAI	standard	87.5%	Strong
34	o3-mini	OpenAI	reasoning	86.9%	Strong
35	Claude 3 Opus	Anthropic	standard	86.8%	Strong
36	GPT-4 Turbo	OpenAI	standard	86.5%	Strong
37	Llama 3.3 70B	Meta	standard	86.3%	Strong
38	Qwen 2.5 72B	Alibaba	standard	86.1%	Strong
39	Llama 3.1 70B	Meta	standard	86.0%	Strong
40	Gemini 1.5 Pro	Google	standard	85.9%	Strong
41	Gemini 2.5 Flash	Google	hybrid	85.8%	Strong
42	o1-mini	OpenAI	reasoning	85.2%	Strong
43	Phi-4	Microsoft	standard	84.8%	Strong
44	Mistral Large 2	Mistral AI	standard	84.7%	Strong
45	Claude Haiku 4.5	Anthropic	standard	84.5%	Strong
46	Mistral Large 2	Mistral AI	standard	84.0%	Strong
47	GPT-4o mini	OpenAI	standard	82.0%	Strong
48	Claude 3.5 Haiku	Anthropic	standard	80.9%	Strong
49	Llama 4 Scout	Meta	standard	79.6%	Strong
50	Mixtral 8x22B	Mistral AI	standard	77.3%	Strong
51	Gemini 2.0 Flash	Google	standard	76.4%	Strong
52	Command R+	Cohere	standard	75.7%	Strong
53	Gemma 2 27B	Google	standard	75.2%	Strong

How to Read This Page

Performance Tiers

Elite - Top 10% of the score range

Strong - Top 25% of the score range

Good - Above the midpoint

Below Average - Below the midpoint

Model Types

Standard - Direct inference, no chain-of-thought

Reasoning - Extended thinking (o1, R1) - slower but excels on math/reasoning

Hybrid - Optional thinking mode (Claude 3.7, Gemini 2.5) - can switch between fast and deep

Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.

Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-05-12T12:30:07.524Z. Some scores are approximate.

超越基准测试

基准测试只讲述了部分故事。我们的综合评分结合了真实功能、定价、上下文窗口等。进行模型对比或浏览完整排行榜以获得全面了解。

综合排行榜|对比模型|评分方法

AI基准测试 - LLM性能评分

基准测试类别

knowledge

reasoning

math

coding

instruction

arena

Massive Multitask Language Understanding

MMLU Scores (53 models)

How to Read This Page

超越基准测试

相关页面

AI基准测试 - LLM性能评分

基准测试类别

knowledge

reasoning

math

coding

instruction

arena

Massive Multitask Language Understanding

MMLU Scores (53 models)

How to Read This Page

超越基准测试

相关页面