哪些AI模型随时间最为一致?本报告分析了 300 个被追踪模型的排名变化、状态分类和波动曲线,生成0到100的稳定性评分。
坚如磐石
55
一致
68
可变
37
波动
140
稳定性评分最高的前20个模型。这些模型保持一致的排名,波动性最小。
| # | 模型 | 评分 | 稳定性 | 24小时 | 7天 |
|---|---|---|---|---|---|
| 1 | GPT-5.4OpenAI | 94.0 | 100 | -1 | -1 |
| 2 | o3 Deep ResearchOpenAI | 91.5 | 100 | -1 | -1 |
| 3 | GPT-5OpenAI | 90.0 | 100 | 0 | +1 |
| 4 | Qwen Plus 0728Alibaba | 76.8 | 100 | 0 | +1 |
| 5 | GPT Audio MiniOpenAI | 68.4 | 100 | -1 | -1 |
| 6 | SonarPerplexity | 53.5 | 100 | 0 | -1 |
| 7 | LFM2-8B-A1BLiquid AI | 53.2 | 100 | -1 | -1 |
| 8 | Nova Micro 1.0Amazon | 51.0 | 100 | 0 | -1 |
| 9 | Olmo 2 32B InstructAllen AI | 44.3 | 100 | -1 | 0 |
| 10 | GPT-4 Turbo (older v1106)OpenAI | 42.7 | 100 | 0 | -1 |
| 11 | GPT-4OpenAI | 39.0 | 100 | 0 | +1 |
| 12 | Llama 3.2 3B InstructMeta | 35.8 | 100 | -1 | -1 |
| 13 | Llama 3.2 3B Instruct (free)Meta | 35.0 | 100 | 0 | -1 |
| 14 | GPT-3.5 Turbo InstructOpenAI | 32.2 | 100 | 0 | +2 |
| 15 | WizardLM-2 8x22BMicrosoft | 32.0 | 100 | +1 | 0 |
| 16 | Gemma 2 9BGoogle | 30.1 | 100 | -1 | 0 |
| 17 | Mistral 7B Instruct v0.1Mistral AI | 19.8 | 100 | 0 | 0 |
| 18 | o3 ProOpenAI | 87.5 | 99 | +1 | +1 |
| 19 | GPT-5.4 MiniOpenAI | 93.3 | 99 | -1 | +1 |
| 20 | Llemma 7beleutherai | 47.3 | 98 | 0 | +3 |
稳定性评分最低的后20个模型。这些模型表现出显著的排名波动或不一致的状态。
| # | 模型 | 评分 | 稳定性 | 24小时 | 7天 |
|---|---|---|---|---|---|
| 1 | Ministral 3 8B 2512Mistral AI | 73.5 | 35 | -18 | +17 |
| 2 | Ministral 3 14B 2512Mistral AI | 73.5 | 35 | -20 | +16 |
| 3 | Devstral 2 2512Mistral AI | 67.7 | 35 | -10 | +14 |
| 4 | Composer 2Cursor | 76.4 | 35 | +10 | +15 |
| 5 | Mistral Small CreativeMistral AI | 59.0 | 35 | -14 | +7 |
| 6 | Mistral Small 3.2 24BMistral AI | 67.2 | 35 | +15 | +9 |
| 7 | Mistral Medium 3.1Mistral AI | 70.1 | 35 | -10 | +9 |
| 8 | Olmo 3.1 32B InstructAllen AI | 64.9 | 35 | +8 | +22 |
| 9 | MiMo-V2-OmniXiaomi | 85.0 | 35 | +22 | +16 |
| 10 | GPT-4o Search PreviewOpenAI | 63.4 | 35 | +7 | +7 |
| 11 | gpt-oss-20b (free)OpenAI | 73.7 | 35 | +25 | +20 |
| 12 | Llama 3.2 11B Vision InstructMeta | 54.3 | 36 | -8 | +7 |
| 13 | o4 Mini Deep ResearchOpenAI | 85.0 | 36 | -7 | +11 |
| 14 | GPT-5 NanoOpenAI | 75.5 | 36 | -8 | +18 |
| 15 | Seed 1.6ByteDance | 85.0 | 36 | -6 | +12 |
| 16 | GPT-4.1 NanoOpenAI | 80.5 | 36 | +12 | +12 |
| 17 | GPT-5.3 ChatOpenAI | 85.0 | 36 | +31 | +16 |
| 18 | gpt-oss-120b (free)OpenAI | 73.7 | 36 | -6 | +28 |
| 19 | Qwen Plus 0728 (thinking)Alibaba | 82.7 | 36 | -9 | +7 |
| 20 | Claude Opus 4.1Anthropic | 81.9 | 36 | -25 | +8 |
各服务商的汇总稳定性指标。服务商按所有模型的平均稳定性评分排名。
| 提供商 | 模型 | 平均稳定性 |
|---|---|---|
| eleutherai | 1 | 98.0 |
| Windsurf | 1 | 93.7 |
| Inflection | 2 | 91.4 |
| Microsoft | 2 | 89.7 |
| Vercel | 1 | 87.5 |
| JetBrains | 1 | 82.9 |
| Cohere | 4 | 75.0 |
| Meituan | 1 | 74.6 |
| Meta | 14 | 71.5 |
| Anthropic | 13 | 68.3 |
| OpenAI | 60 | 63.9 |
| StepFun | 2 | 63.9 |
| Liquid AI | 5 | 63.8 |
| xAI | 10 | 63.7 |
| Allen AI | 4 | 62.1 |
| Mistral AI | 25 | 59.7 |
| DeepSeek | 11 | 59.6 |
| Upstage | 1 | 58.2 |
| Amazon | 5 | 58.0 |
| Perplexity | 5 | 57.8 |
| Alibaba | 50 | 57.0 |
| 23 | 56.7 | |
| Cursor | 2 | 56.5 |
| aion-labs | 3 | 56.3 |
| reka | 1 | 55.9 |
| arcee-ai | 7 | 55.0 |
| NVIDIA | 11 | 52.8 |
| MiniMax | 8 | 52.1 |
| Xiaomi | 3 | 50.8 |
| Inception | 3 | 49.9 |
| ByteDance | 5 | 48.9 |
| Baidu | 5 | 47.8 |
| AI21 Labs | 1 | 45.7 |
| IBM | 1 | 44.1 |
| essentialai | 1 | 40.8 |
| Moonshot AI | 4 | 40.1 |
| Writer | 1 | 37.7 |
| Tencent | 1 | 37.5 |
| Kuaishou | 1 | 37.4 |
| deepcogito | 1 | 36.0 |
所有 300 个被追踪模型的稳定性评分分布。
我们的稳定性评分系统使用三个关键信号来衡量模型随时间的一致性表现。
稳定性的最直接衡量标准。模型因24小时内较大的排名变化最多失去25分(每移动一个排名位置扣5分),7天变化最多失去21分(每个位置扣3分)。排名保持稳定的模型得分更高。
每个模型都有一个反映其整体可靠性的状态。处于"稳定"状态的模型获得10分加分,而"脆弱"模型被扣15分。这捕捉了超越简单排名变动的系统性可靠性。
14天的波动曲线数据揭示了隐藏的波动性。我们计算波动曲线的标准差并最多减去20分。即使最终回到起点的模型,如果中间剧烈波动也会被扣分。
The stability score starts at 100 and is reduced based on three factors: 24-hour rank changes (up to -25 points, at 5 per position moved), 7-day rank changes (up to -21 points, at 3 per position), and sparkline volatility measured by standard deviation (up to -20 points). Models in a "stable" state get a +10 bonus, while "fragile" models lose 15 points.
Models are classified into four tiers based on their stability score: "Rock Solid" (85-100) means extremely consistent performance with minimal fluctuation. "Consistent" (70-84) means generally reliable with minor variations. "Variable" (50-69) shows noticeable ranking fluctuations. "Volatile" (below 50) indicates significant instability and unpredictable performance.
Stability indicates how predictably a model will perform over time. A highly rated but volatile model may deliver inconsistent results, which is problematic for production applications requiring reliable output quality. Stable models provide more predictable performance, making them safer choices for mission-critical workloads even if they do not always hold the top rank.