并非所有排名变化都有意义,有些只是随机噪音。此页面使用统计分析来判断哪些模型评分变动是真实趋势,哪些只是正常波动。
已分析模型
295
显著变化
0
噪音 (不显著)
295
双时间维度
2
0 个模型的近期评分偏离其历史平均值足以被认为是真实变化(非噪音)。按变化极端程度排序。Z分数衡量变化的异常程度--超过±1.96表示有95%的概率该变化是真实的。
没有显著变化
所有模型评分变化均在正常统计方差范围内。
24小时内的排名变化可能只是暂时的。但如果7天内也在变动,那就是真正的趋势。在两个时间维度上都被标记的模型最值得关注--它们代表已确认的持续性能变化。
| 模型 | 评分 | 24小时变化 | 7天变化 |
|---|---|---|---|
| Mistral NemoMistral AI | 39.9 | -11 | -11 |
| GLM 5V TurboZhipu AI | 40.0 | -146 | +115 |
没有仅在每日维度上显著的模型。
没有仅在每周维度上显著的模型。
有些模型天然评分稳定--即使小幅排名变化也有意义。其他模型评分波动较大--需要更大的变化才值得关注。CV%(变异系数)告诉你每个模型的波动程度。越高 = 越嘈杂。
| 模型 | 评分 | CV% |
|---|---|---|
| Gemma 2 27BGoogle | 77.1 | 57.6% |
| Qwen2.5 Coder 32B InstructAlibaba | 40.0 | 55.8% |
| Coder Largearcee-ai | 39.3 | 52.6% |
| Gemma 3n 4BGoogle | 40.0 | 51.8% |
| Llama 3.3 70B Instruct (free)Meta | 65.5 | 51.6% |
| Phi 4Microsoft | 59.9 | 51.5% |
| R1DeepSeek | 73.7 | 45.6% |
| GPT-4o Search PreviewOpenAI | 70.0 | 43.6% |
| Command ACohere | 50.4 | 42.1% |
| GPT-4OpenAI | 64.5 | 41.8% |
| Mixtral 8x22B InstructMistral AI | 63.0 | 41.2% |
| Llama 3.1 70B InstructMeta | 64.9 | 41.1% |
| Llama 3.3 70B InstructMeta | 66.4 | 40.7% |
| Mistral LargeMistral AI | 65.5 | 40.6% |
| MiniMax M2-herMiniMax | 68.8 | 40.5% |
| o3 MiniOpenAI | 74.9 | 39.2% |
| DeepSeek V3DeepSeek | 69.0 | 38.7% |
| GPT-4o-mini Search PreviewOpenAI | 60.4 | 38.7% |
| Nova Micro 1.0Amazon | 40.0 | 38.0% |
| GPT-4 Turbo PreviewOpenAI | 59.4 | 37.4% |
| 模型 | 评分 | CV% |
|---|---|---|
| Claude Fable 5Anthropic | 96.6 | 0.0% |
| Claude Opus 4.7 (Fast)Anthropic | 94.7 | 0.0% |
| Kimi K2.7 CodeMoonshot AI | 53.7 | 0.0% |
| Claude Fable Latest~anthropic | 40.0 | 0.0% |
| Nemotron 3.5 Content Safety (free)NVIDIA | 40.0 | 0.0% |
| Nemotron 3 Ultra (free)NVIDIA | 40.0 | 0.0% |
| Nemotron 3 UltraNVIDIA | 40.0 | 0.0% |
| Qwen3.7 PlusAlibaba | 40.0 | 0.0% |
| Step 3.7 FlashStepFun | 40.0 | 0.0% |
| Qwen3.7 MaxAlibaba | 40.0 | 0.0% |
| Grok Build 0.1xAI | 40.0 | 0.0% |
| Perceptron Mk1perceptron | 40.0 | 0.0% |
| Ring-2.6-1Tinclusionai | 40.0 | 0.0% |
| GPT Chat LatestOpenAI | 40.0 | 0.0% |
| Nemotron 3 Nano Omni (free)NVIDIA | 40.0 | 0.0% |
| Laguna XS.2 (free)poolside | 40.0 | 0.0% |
| Laguna M.1 (free)poolside | 40.0 | 0.0% |
| Anthropic Claude Haiku Latest~anthropic | 40.0 | 0.0% |
| OpenAI GPT Mini Latest~openai | 40.0 | 0.0% |
| Google Gemini Pro Latest~google | 40.0 | 0.0% |
了解我们显著性分析背后的统计方法,帮助您区分真实的性能变化和随机波动。
我们使用95%置信度阈值(|z| > 1.96)的z分数。z分数衡量模型当前评分偏离其历史基准的标准差倍数。只有超过1.96个标准差的变化才被标记为统计显著。
基准值是根据每个模型14天波动曲线数据的算术平均值计算的。该滚动平均值平滑了每日波动,提供了检测有意义偏差的稳定参考点。
每个模型的95%置信区间计算公式为:基准值 +/- 1.96 x 标准差。落在此范围之外的评分表示统计上有意义的变化。"置信度"列显示 +/- 阈值。
每日(24小时)和每周(7天)排名变化分别分析。每日显著性要求排名移动超过3位,每周要求超过5位。在两个时间维度上都显著的模型代表最强、最可靠的信号。
变异系数(CV%)衡量相对波动性。高CV模型天然评分嘈杂,需要更大的绝对变化才能达到显著性。低CV模型更可预测,因此即使小偏差也可能代表真实变化。
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.