来自官方模型卡片和第三方评估的真实基准测试分数。比较49个模型在20个基准测试中的表现——从MMLU和GPQA Diamond到SWE-bench和Arena Elo。按类别、模型类型筛选,在图表和矩阵视图之间切换。
直接跳转到最强的基准测试集群,而不是从完整矩阵开始。
3 个基准测试
Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.
4 个基准测试
One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.
3 个基准测试
Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.
6 个基准测试
The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.
2 个基准测试
Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.
2 个基准测试
The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.
Tests broad knowledge across 57 academic subjects (STEM, humanities, social sciences) with 16,000 multiple-choice questions. The most widely-cited LLM benchmark.
Why it matters
Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.
| # | Model | Score |
|---|---|---|
| 1 | 🥇GPT-5.4 | 94.0% |
| 2 | 🥈GPT-5.2 | 93.5% |
| 3 | 🥉GPT-5 | 93.0% |
| 4 | Gemini 3 Pro | 92.5% |
| 5 | o3 | 92.3% |
| 6 | Claude Opus 4.6 | 92.1% |
| 7 | o1 | 91.8% |
| 8 | DeepSeek R1-0528 | 91.5% |
| 9 | Grok 4 | 91.5% |
| 10 | Claude Opus 4.5 | 91.4% |
| 11 | Claude Sonnet 4.6 | 91.2% |
| 12 | Claude Opus 4 | 91.0% |
| 13 | Gemini 2.5 Pro | 90.8% |
| 14 | DeepSeek R1 | 90.8% |
| 15 | Claude Sonnet 4.5 | 90.8% |
| 16 | Claude 3.7 Sonnet | 90.2% |
| 17 | Claude Sonnet 4 | 89.5% |
| 18 | GPT-4.1 | 89.2% |
| 19 | DeepSeek V3 (March 2025) | 89.2% |
| 20 | GPT-4o | 88.7% |
| 21 | Claude 3.5 Sonnet | 88.7% |
| 22 | Llama 3.1 405B | 88.6% |
| 23 | DeepSeek V3 | 88.5% |
| 24 | Grok 3 | 88.5% |
| 25 | Llama 4 Maverick | 88.0% |
| 26 | Gemini 3 Flash | 88.0% |
| 27 | Grok 2 | 87.5% |
| 28 | o3-mini | 86.9% |
| 29 | Claude 3 Opus | 86.8% |
| 30 | GPT-4 Turbo | 86.5% |
| 31 | Llama 3.3 70B | 86.3% |
| 32 | Qwen 2.5 72B | 86.1% |
| 33 | Llama 3.1 70B | 86.0% |
| 34 | Gemini 1.5 Pro | 85.9% |
| 35 | Gemini 2.5 Flash | 85.8% |
| 36 | o1-mini | 85.2% |
| 37 | Phi-4 | 84.8% |
| 38 | Mistral Large 2 | 84.7% |
| 39 | Claude Haiku 4.5 | 84.5% |
| 40 | Mistral Large 2 | 84.0% |
| 41 | GPT-4o mini | 82.0% |
| 42 | Claude 3.5 Haiku | 80.9% |
| 43 | Mixtral 8x22B | 77.3% |
| 44 | Gemini 2.0 Flash | 76.4% |
| 45 | Command R+ | 75.7% |
| 46 | Gemma 2 27B | 75.2% |
Performance Tiers
Model Types
Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.
Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-03-07. Some scores are approximate.
AI基准测试是衡量AI模型在特定任务上表现的标准化测试。常见基准测试包括MMLU(通用知识)、SWE-bench(编程)、GPQA(科学推理)、MATH-500(数学)、Arena Elo(人类偏好)和HumanEval(代码生成)。
没有单一基准测试能捕捉全貌。MMLU测试知识广度,SWE-bench测试现实世界编程能力,Arena Elo反映人类偏好。我们建议综合查看多个基准测试,这就是为什么我们的综合评分权衡了多个维度。
我们的基准测试数据每小时从服务商API和社区评估中刷新。新的基准测试会在成为行业标准后添加。Arena Elo评分根据用户投票持续更新。
基准测试是有用的指标但不是完美的预测器。在MMLU上得分高的模型不一定最适合创意写作,SWE-bench高分也不保证更快的编程辅助。现实世界表现取决于您的具体使用场景、提示工程和集成方法。