并非所有排名变化都有意义,有些只是随机噪音。此页面使用统计分析来判断哪些模型评分变动是真实趋势,哪些只是正常波动。
已分析模型
300
显著变化
35
噪音 (不显著)
265
双时间维度
130
35 个模型的近期评分偏离其历史平均值足以被认为是真实变化(非噪音)。按变化极端程度排序。Z分数衡量变化的异常程度——超过±1.96表示有95%的概率该变化是真实的。
24小时内的排名变化可能只是暂时的。但如果7天内也在变动,那就是真正的趋势。在两个时间维度上都被标记的模型最值得关注——它们代表已确认的持续性能变化。
有些模型天然评分稳定——即使小幅排名变化也有意义。其他模型评分波动较大——需要更大的变化才值得关注。CV%(变异系数)告诉你每个模型的波动程度。越高 = 越嘈杂。
| 模型 | 评分 | CV% |
|---|---|---|
| Mistral 7B Instruct v0.1Mistral AI | 19.8 | 6.8% |
| Llama 3.2 1B InstructMeta | 31.8 | 5.9% |
| GPT-3.5 Turbo InstructOpenAI | 32.2 | 5.1% |
| autofixer-01Vercel | 38.8 | 4.8% |
| GPT-4OpenAI | 39.0 | 4.8% |
| Inflection 3 ProductivityInflection | 36.6 | 4.5% |
| SabaMistral AI | 52.7 | 4.1% |
| Coder Largearcee-ai | 45.3 | 4.1% |
| GPT-3.5 Turbo 16kOpenAI | 39.9 | 4.1% |
| Llama 3.3 70B Instruct (free)Meta | 43.9 | 4.1% |
| GPT-3.5 TurboOpenAI | 39.9 | 4.0% |
| Qwen2.5 Coder 7B InstructAlibaba | 42.8 | 3.8% |
| Mistral Large 2407Mistral AI | 52.9 | 3.8% |
| Qwen2.5 7B InstructAlibaba | 45.5 | 3.7% |
| Pixtral Large 2411Mistral AI | 55.5 | 3.6% |
| GPT-4o (2024-08-06)OpenAI | 55.4 | 3.6% |
| SWE-1.5Windsurf | 49.1 | 3.6% |
| Mistral LargeMistral AI | 54.1 | 3.5% |
| Command R+ (08-2024)Cohere | 47.6 | 3.5% |
| Command R (08-2024)Cohere | 47.6 | 3.5% |
| 模型 | 评分 | CV% |
|---|---|---|
| Grok 4 FastxAI | 83.2 | 0.4% |
| SonarPerplexity | 53.5 | 0.5% |
| Sonar Pro SearchPerplexity | 85.0 | 0.5% |
| Gemini 3.1 Pro Preview Custom ToolsGoogle | 85.0 | 0.5% |
| Gemini 3.1 Flash Lite PreviewGoogle | 81.9 | 0.5% |
| Grok 4.20 BetaxAI | 85.7 | 0.5% |
| Gemini 2.5 Pro Preview 06-05Google | 84.1 | 0.6% |
| Gemini 2.5 Flash Lite Preview 09-2025Google | 83.6 | 0.6% |
| Gemini 2.5 Pro Preview 05-06Google | 82.5 | 0.6% |
| Solar Pro 3Upstage | 72.5 | 0.6% |
| GPT-5.4OpenAI | 94.0 | 0.6% |
| Gemini 2.5 FlashGoogle | 80.0 | 0.6% |
| GPT-5.2OpenAI | 92.7 | 0.6% |
| Nemotron 3 Nano 30B A3B (free)NVIDIA | 67.7 | 0.6% |
| GPT-5 ProOpenAI | 91.9 | 0.6% |
| Kimi K2.5Moonshot AI | 85.0 | 0.6% |
| Qwen3 Coder PlusAlibaba | 78.4 | 0.6% |
| Qwen3.5 Plus 2026-02-15Alibaba | 85.0 | 0.6% |
| Qwen3 Coder FlashAlibaba | 78.1 | 0.6% |
| Qwen3 Coder NextAlibaba | 76.7 | 0.6% |
了解我们显著性分析背后的统计方法,帮助您区分真实的性能变化和随机波动。
我们使用95%置信度阈值(|z| > 1.96)的z分数。z分数衡量模型当前评分偏离其历史基准的标准差倍数。只有超过1.96个标准差的变化才被标记为统计显著。
基准值是根据每个模型14天波动曲线数据的算术平均值计算的。该滚动平均值平滑了每日波动,提供了检测有意义偏差的稳定参考点。
每个模型的95%置信区间计算公式为:基准值 +/- 1.96 x 标准差。落在此范围之外的评分表示统计上有意义的变化。"置信度"列显示 +/- 阈值。
每日(24小时)和每周(7天)排名变化分别分析。每日显著性要求排名移动超过3位,每周要求超过5位。在两个时间维度上都显著的模型代表最强、最可靠的信号。
变异系数(CV%)衡量相对波动性。高CV模型天然评分嘈杂,需要更大的绝对变化才能达到显著性。低CV模型更可预测,因此即使小偏差也可能代表真实变化。
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.