来自官方模型卡片和第三方评估的真实基准测试分数。比较162个模型在21个基准测试中的表现--从MMLU和GPQA Diamond到SWE-bench和Arena Elo。按类别、模型类型筛选,在图表和矩阵视图之间切换。
直接跳转到最强的基准测试集群,而不是从完整矩阵开始。
3 个基准测试
Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.
5 个基准测试
One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.
3 个基准测试
Tests genuine mathematical reasoning, not just pattern matching. Reasoning models (o1, R1) dramatically outperform standard models here.
6 个基准测试
The most recognized coding benchmark, though becoming saturated above 90%. Evidence of training data contamination in some models.
2 个基准测试
Measures instruction-following precision, critical for production applications. Models that score well here are more reliable in structured tasks.
2 个基准测试
The most trusted 'vibes-based' benchmark — reflects real human preferences, not just academic metrics. Widely considered the most meaningful overall ranking.
Tests broad knowledge across 57 academic subjects (STEM, humanities, social sciences) with 16,000 multiple-choice questions. The most widely-cited LLM benchmark.
Why it matters
Shows how well a model has absorbed factual knowledge during training. Saturating above 90%, so less useful for differentiating frontier models.
Performance Tiers
Model Types
Saturated benchmarks have top models clustered above 90%, making them less useful for comparison.
Scores sourced from official model cards, technical reports, and third-party evaluations (Artificial Analysis, LMSYS Arena). Last updated: 2026-05-12T12:30:07.524Z. Some scores are approximate.
AI基准测试是衡量AI模型在特定任务上表现的标准化测试。常见基准测试包括MMLU(通用知识)、SWE-bench(编程)、GPQA(科学推理)、MATH-500(数学)、Arena Elo(人类偏好)和HumanEval(代码生成)。
没有单一基准测试能捕捉全貌。MMLU测试知识广度,SWE-bench测试现实世界编程能力,Arena Elo反映人类偏好。我们建议综合查看多个基准测试,这就是为什么我们的综合评分权衡了多个维度。
我们的基准测试数据每小时从服务商API和社区评估中刷新。新的基准测试会在成为行业标准后添加。Arena Elo评分根据用户投票持续更新。
基准测试是有用的指标但不是完美的预测器。在MMLU上得分高的模型不一定最适合创意写作,SWE-bench高分也不保证更快的编程辅助。现实世界表现取决于您的具体使用场景、提示工程和集成方法。