哪些AI模型随时间最为一致?本报告分析了 300 个被追踪模型的排名变化、状态分类和波动曲线,生成0到100的稳定性评分。
坚如磐石
273
一致
24
可变
1
波动
2
稳定性评分最高的前20个模型。这些模型保持一致的排名,波动性最小。
| # | 模型 | 评分 | 稳定性 | 24小时 | 7天 |
|---|---|---|---|---|---|
| 1 | Claude Fable 5Anthropic | 96.6 | 100 | 0 | 0 |
| 2 | Claude Opus 4.7 (Fast)Anthropic | 94.7 | 100 | 0 | 0 |
| 3 | Claude Opus 4.8 (Fast)Anthropic | 94.2 | 100 | 0 | 0 |
| 4 | Claude Opus 4.8Anthropic | 94.2 | 100 | 0 | 0 |
| 5 | GPT-5.5OpenAI | 92.2 | 100 | 0 | 0 |
| 6 | GPT-5.5 ProOpenAI | 90.3 | 100 | 0 | 0 |
| 7 | Claude Opus 4.6 (Fast)Anthropic | 90.0 | 100 | 0 | 0 |
| 8 | Grok 4.20xAI | 88.3 | 100 | 0 | 0 |
| 9 | Grok 4.20 Multi-AgentxAI | 87.4 | 100 | 0 | 0 |
| 10 | DeepSeek V4 ProDeepSeek | 86.2 | 100 | 0 | 0 |
| 11 | Claude Sonnet 4.6Anthropic | 84.7 | 100 | 0 | 0 |
| 12 | Grok 4.3xAI | 80.5 | 100 | 0 | 0 |
| 13 | Gemma 4 31B (free)Google | 80.0 | 100 | 0 | 0 |
| 14 | GPT-5.4 NanoOpenAI | 78.8 | 100 | 0 | 0 |
| 15 | GPT-5.4 MiniOpenAI | 78.8 | 100 | 0 | 0 |
| 16 | Gemini 3.5 FlashGoogle | 78.5 | 100 | 0 | 0 |
| 17 | GLM 5.2Zhipu AI | 78.1 | 100 | +1 | 0 |
| 18 | DeepSeek V4 FlashDeepSeek | 77.2 | 100 | +1 | 0 |
| 19 | GLM 5.1Zhipu AI | 76.1 | 100 | +1 | 0 |
| 20 | Kimi K2.6Moonshot AI | 75.2 | 100 | +1 | 0 |
稳定性评分最低的后20个模型。这些模型表现出显著的排名波动或不一致的状态。
| # | 模型 | 评分 | 稳定性 | 24小时 | 7天 |
|---|---|---|---|---|---|
| 1 | Mistral NemoMistral AI | 39.9 | 23 | -11 | -11 |
| 2 | GLM 5V TurboZhipu AI | 40.0 | 34 | -146 | +115 |
| 3 | Fugu Ultrasakana | 40.0 | 54 | +147 | +147 |
| 4 | Trinity Large Thinkingarcee-ai | 62.7 | 73 | +1 | -4 |
| 5 | Command R+ (08-2024)Cohere | 48.3 | 74 | +2 | +2 |
| 6 | Coder Largearcee-ai | 39.3 | 82 | -1 | -1 |
| 7 | Qwen3.5 Plus 2026-02-15Alibaba | 40.0 | 82 | -1 | -1 |
| 8 | Seed-2.0-MiniByteDance | 40.0 | 82 | -1 | -1 |
| 9 | Command R (08-2024)Cohere | 48.3 | 82 | +1 | +1 |
| 10 | Command ACohere | 50.4 | 82 | +1 | +1 |
| 11 | Claude 3 HaikuAnthropic | 50.9 | 82 | +1 | +1 |
| 12 | Kimi K2 0711Moonshot AI | 51.0 | 82 | +1 | +1 |
| 13 | Qwen3 235B A22BAlibaba | 53.5 | 82 | +1 | +1 |
| 14 | Llama 4 ScoutMeta | 54.9 | 82 | +1 | +1 |
| 15 | Mistral Large 2407Mistral AI | 55.8 | 82 | +1 | +1 |
| 16 | GPT-4o-mini (2024-07-18)OpenAI | 56.1 | 82 | +1 | +1 |
| 17 | gpt-oss-20b (free)OpenAI | 57.1 | 82 | +1 | +1 |
| 18 | Mixtral 8x22B InstructMistral AI | 63.0 | 82 | +1 | +1 |
| 19 | o3 Mini HighOpenAI | 63.5 | 82 | +1 | +1 |
| 20 | Llama 3.1 8B InstructMeta | 44.1 | 82 | +1 | +1 |
各服务商的汇总稳定性指标。服务商按所有模型的平均稳定性评分排名。
| 提供商 | 模型 | 平均稳定性 |
|---|---|---|
| xAI | 4 | 100.0 |
| Tencent | 2 | 100.0 |
| ~anthropic | 4 | 100.0 |
| perceptron | 1 | 100.0 |
| inclusionai | 3 | 100.0 |
| poolside | 4 | 100.0 |
| ~openai | 2 | 100.0 |
| 2 | 100.0 | |
| ~moonshotai | 1 | 100.0 |
| deepcogito | 1 | 100.0 |
| AI21 Labs | 1 | 100.0 |
| HUMAIN | 3 | 100.0 |
| TII | 6 | 100.0 |
| Baidu | 1 | 98.6 |
| Kuaishou | 1 | 97.7 |
| Perplexity | 5 | 96.6 |
| Amazon | 5 | 96.5 |
| NVIDIA | 11 | 96.4 |
| rekaai | 2 | 96.2 |
| Writer | 1 | 96.1 |
| Inception | 1 | 95.6 |
| Upstage | 1 | 94.1 |
| Anthropic | 15 | 94.1 |
| Alibaba | 48 | 93.7 |
| 22 | 93.6 | |
| Windsurf | 1 | 93.5 |
| Moonshot AI | 6 | 93.0 |
| StepFun | 2 | 92.8 |
| Microsoft | 2 | 92.5 |
| aion-labs | 3 | 92.4 |
| OpenAI | 58 | 89.8 |
| Liquid AI | 3 | 89.7 |
| MiniMax | 8 | 89.7 |
| IBM | 2 | 88.9 |
| Mistral AI | 18 | 88.6 |
| DeepSeek | 11 | 88.5 |
| Meta | 8 | 86.9 |
| ByteDance | 5 | 86.6 |
| arcee-ai | 4 | 86.3 |
| Zhipu AI | 12 | 85.9 |
| Xiaomi | 2 | 85.0 |
| Cursor | 2 | 85.0 |
| Allen AI | 1 | 84.7 |
| Cohere | 4 | 84.5 |
| sakana | 1 | 54.0 |
所有 300 个被追踪模型的稳定性评分分布。
我们的稳定性评分系统使用三个关键信号来衡量模型随时间的一致性表现。
稳定性的最直接衡量标准。模型因24小时内较大的排名变化最多失去25分(每移动一个排名位置扣5分),7天变化最多失去21分(每个位置扣3分)。排名保持稳定的模型得分更高。
每个模型都有一个反映其整体可靠性的状态。处于"稳定"状态的模型获得10分加分,而"脆弱"模型被扣15分。这捕捉了超越简单排名变动的系统性可靠性。
14天的波动曲线数据揭示了隐藏的波动性。我们计算波动曲线的标准差并最多减去20分。即使最终回到起点的模型,如果中间剧烈波动也会被扣分。
The stability score starts at 100 and is reduced based on three factors: 24-hour rank changes (up to -25 points, at 5 per position moved), 7-day rank changes (up to -21 points, at 3 per position), and sparkline volatility measured by standard deviation (up to -20 points). Models in a "stable" state get a +10 bonus, while "fragile" models lose 15 points.
Models are classified into four tiers based on their stability score: "Rock Solid" (85-100) means extremely consistent performance with minimal fluctuation. "Consistent" (70-84) means generally reliable with minor variations. "Variable" (50-69) shows noticeable ranking fluctuations. "Volatile" (below 50) indicates significant instability and unpredictable performance.
Stability indicates how predictably a model will perform over time. A highly rated but volatile model may deliver inconsistent results, which is problematic for production applications requiring reliable output quality. Stable models provide more predictable performance, making them safer choices for mission-critical workloads even if they do not always hold the top rank.