分析我们对每个模型排名的置信度。排名范围展示模型可能持有的位置范围,置信度水平表示排名精确度,稳定性状态反映随时间的一致性。
全部300个模型的排名置信度概览。
高置信度
300
100.0% of models
中等置信度
0
0.0% of models
低置信度
0
0.0% of models
平均排名跨度
4.0
位的不确定性
按置信度水平分类的模型细分,包括评分、范围和排名的平均值。
| 置信度级别 | 数量 | % |
|---|---|---|
| High | 300 | 100.0% |
| Medium | 0 | 0.0% |
| Low | 0 | 0.0% |
排名范围最窄的模型。这些是我们最有信心的排名。
| # | 模型 | 评分 | 排名 | 跨度 |
|---|---|---|---|---|
| 1 | GPT-5.4 Pro | 94.0 | 1 | ±2 |
| 2 | GPT-5.4 | 94.0 | 2 | ±3 |
| 3 | GPT-5.4 Mini | 93.3 | 3 | ±4 |
| 4 | GPT-5.2 Pro | 92.7 | 4 | ±4 |
| 5 | GPT-5.2 | 92.7 | 5 | ±4 |
| 6 | Claude Opus 4.6 | 92.1 | 6 | ±4 |
| 7 | GPT-5 Pro | 91.9 | 7 | ±4 |
| 8 | o3 Deep Research | 91.5 | 8 | ±4 |
| 9 | Claude Opus 4.5 | 90.4 | 9 | ±4 |
| 10 | GPT-5 | 90.0 | 10 | ±4 |
| 11 | Gemini 3 Flash Preview | 89.4 | 11 | ±4 |
| 12 | Claude Sonnet 4.6 | 89.2 | 12 | ±4 |
| 13 | Claude Sonnet 4.5 | 89.0 | 13 | ±4 |
| 14 | o3 Pro | 87.5 | 14 | ±4 |
| 15 | Grok 4.1 Fast | 86.9 | 15 | ±4 |
| 16 | Grok 4.20 Beta | 85.7 | 16 | ±4 |
| 17 | Grok 4 | 85.7 | 17 | ±4 |
| 18 | Gemini 3.1 Pro Preview | 85.5 | 18 | ±4 |
| 19 | o3 | 85.5 | 19 | ±4 |
| 20 | GPT-5.1 | 85.2 | 20 | ±4 |
排名范围最宽的模型。这些模型在细微变化下可能排名差异很大。
| # | 模型 | 评分 | 排名 | 跨度 |
|---|---|---|---|---|
| 1 | GPT-5.4 Mini | 93.3 | 3 | ±4 |
| 2 | GPT-5.2 Pro | 92.7 | 4 | ±4 |
| 3 | GPT-5.2 | 92.7 | 5 | ±4 |
| 4 | Claude Opus 4.6 | 92.1 | 6 | ±4 |
| 5 | GPT-5 Pro | 91.9 | 7 | ±4 |
| 6 | o3 Deep Research | 91.5 | 8 | ±4 |
| 7 | Claude Opus 4.5 | 90.4 | 9 | ±4 |
| 8 | GPT-5 | 90.0 | 10 | ±4 |
| 9 | Gemini 3 Flash Preview | 89.4 | 11 | ±4 |
| 10 | Claude Sonnet 4.6 | 89.2 | 12 | ±4 |
| 11 | Claude Sonnet 4.5 | 89.0 | 13 | ±4 |
| 12 | o3 Pro | 87.5 | 14 | ±4 |
| 13 | Grok 4.1 Fast | 86.9 | 15 | ±4 |
| 14 | Grok 4.20 Beta | 85.7 | 16 | ±4 |
| 15 | Grok 4 | 85.7 | 17 | ±4 |
| 16 | Gemini 3.1 Pro Preview | 85.5 | 18 | ±4 |
| 17 | o3 | 85.5 | 19 | ±4 |
| 18 | GPT-5.1 | 85.2 | 20 | ±4 |
| 19 | MiMo-V2-Omni | 85.0 | 21 | ±4 |
| 20 | MiMo-V2-Pro | 85.0 | 22 | ±4 |
置信度水平与稳定性状态的交叉表。最佳组合是高置信度+稳定;最差是低置信度+脆弱。
| 置信度 | Stable | Held | Fragile | Preliminary |
|---|---|---|---|---|
| High | 139 | 0 | 160 | 1 |
| Medium | 0 | 0 | 0 | 0 |
| Low | 0 | 0 | 0 | 0 |
前30个模型的排名不确定性可视化表示。条形显示90%置信度下的可能排名范围;标记显示实际排名。
排名置信度如何确定以及各指标的含义。
通过评分管道的自助重采样计算。通过运行数千次带有微小变化的模拟,我们确定每个模型可能实际持有的排名范围。范围代表90%的置信区间:在十次中有九次,模型的真实排名在此范围内。
由排名范围宽度得出。范围较窄(不确定性小)的模型获得高置信度,意味着其排名位置是可靠的。较宽的范围表示中等或低置信度,模型的位置可能在不同权重或数据更新下发生显著变化。
基于性能指标随时间一致性的稳定性分类。"稳定"模型显示一致的排名,"保持"模型在一定波动下维持位置,"脆弱"模型容易发生排名变化,"初步"模型缺乏足够的数据历史来评估稳定性。
Ranking confidence is calculated using bootstrap resampling - a statistical technique that re-runs the ranking process thousands of times with slight variations to see how stable each model's position is. Models with narrow rank spreads have high confidence, while those with wide spreads have uncertain rankings.
Rank spread is the range between a model's best and worst possible rank across bootstrap simulations. A rank spread of 2 means the model might move 1 position up or down, while a spread of 20 means its true ranking is quite uncertain.
Low confidence usually means the model scores are clustered closely together with many competitors, making the exact ordering sensitive to small measurement differences. Models in the middle of the leaderboard tend to have wider rank spreads than those at the very top or bottom.