并非所有排名变化都有意义,有些只是随机噪音。此页面使用统计分析来判断哪些模型评分变动是真实趋势,哪些只是正常波动。
已分析模型
276
显著变化
0
噪音 (不显著)
276
双时间维度
0
0 个模型的近期评分偏离其历史平均值足以被认为是真实变化(非噪音)。按变化极端程度排序。Z分数衡量变化的异常程度--超过±1.96表示有95%的概率该变化是真实的。
没有显著变化
所有模型评分变化均在正常统计方差范围内。
24小时内的排名变化可能只是暂时的。但如果7天内也在变动,那就是真正的趋势。在两个时间维度上都被标记的模型最值得关注--它们代表已确认的持续性能变化。
目前没有模型在每日和每周时间维度上均显著。
没有仅在每日维度上显著的模型。
没有仅在每周维度上显著的模型。
有些模型天然评分稳定--即使小幅排名变化也有意义。其他模型评分波动较大--需要更大的变化才值得关注。CV%(变异系数)告诉你每个模型的波动程度。越高 = 越嘈杂。
| 模型 | 评分 | CV% |
|---|---|---|
| Gemma 2 27BGoogle | 77.4 | 80.4% |
| Coder Largearcee-ai | 40.0 | 74.2% |
| Llama Guard 3 8BMeta | 40.0 | 72.7% |
| Llama 3 70B InstructMeta | 57.0 | 69.9% |
| Gemma 3n 4BGoogle | 40.0 | 69.3% |
| Llama 3.3 70B Instruct (free)Meta | 65.7 | 69.1% |
| Phi 4Microsoft | 60.2 | 68.9% |
| R1DeepSeek | 73.0 | 58.3% |
| GPT-4o Search PreviewOpenAI | 70.4 | 55.4% |
| ERNIE 4.5 300B A47B Baidu | 59.6 | 54.2% |
| Command ACohere | 50.8 | 53.1% |
| GPT-4OpenAI | 64.8 | 52.5% |
| GPT-4 (older v0314)OpenAI | 64.8 | 52.3% |
| Mixtral 8x22B InstructMistral AI | 63.4 | 51.5% |
| Llama 3.1 70B InstructMeta | 65.3 | 51.4% |
| Llama 3.3 70B InstructMeta | 66.8 | 50.8% |
| Mistral LargeMistral AI | 65.9 | 50.6% |
| MiniMax M2-herMiniMax | 69.1 | 50.4% |
| o3 MiniOpenAI | 74.5 | 48.3% |
| DeepSeek V3DeepSeek | 69.5 | 47.8% |
| 模型 | 评分 | CV% |
|---|---|---|
| Claude Opus 4.6 (Fast)Anthropic | 90.4 | 0.0% |
| Gemma 4 31B (free)Google | 80.5 | 0.0% |
| DeepSeek V4 ProDeepSeek | 75.7 | 0.0% |
| Trinity Large Previewarcee-ai | 63.6 | 0.0% |
| Claude Opus Latest~anthropic | 40.0 | 0.0% |
| Qianfan-OCR-Fast (free)Baidu | 40.0 | 0.0% |
| Grok 4.3xAI | 76.4 | 0.1% |
| DeepSeek V4 FlashDeepSeek | 72.1 | 0.1% |
| Gemma 4 26B A4B (free)Google | 73.0 | 0.1% |
| GLM 5.1Zhipu AI | 76.1 | 0.1% |
| Kimi K2.6Moonshot AI | 75.9 | 0.3% |
| Grok 4.1 FastxAI | 78.0 | 1.1% |
| Claude Opus 4.7Anthropic | 79.3 | 1.3% |
| MiniMax-01MiniMax | 40.0 | 1.5% |
| Qwen3.5-FlashAlibaba | 68.7 | 1.6% |
| Nemotron 3 Nano 30B A3B (free)NVIDIA | 40.0 | 1.7% |
| Grok 4.20xAI | 88.8 | 1.7% |
| Kimi K2.5Moonshot AI | 59.1 | 2.0% |
| Grok 4 FastxAI | 72.5 | 2.0% |
| Nemotron Nano 9B V2NVIDIA | 40.0 | 2.1% |
了解我们显著性分析背后的统计方法,帮助您区分真实的性能变化和随机波动。
我们使用95%置信度阈值(|z| > 1.96)的z分数。z分数衡量模型当前评分偏离其历史基准的标准差倍数。只有超过1.96个标准差的变化才被标记为统计显著。
基准值是根据每个模型14天波动曲线数据的算术平均值计算的。该滚动平均值平滑了每日波动,提供了检测有意义偏差的稳定参考点。
每个模型的95%置信区间计算公式为:基准值 +/- 1.96 x 标准差。落在此范围之外的评分表示统计上有意义的变化。"置信度"列显示 +/- 阈值。
每日(24小时)和每周(7天)排名变化分别分析。每日显著性要求排名移动超过3位,每周要求超过5位。在两个时间维度上都显著的模型代表最强、最可靠的信号。
变异系数(CV%)衡量相对波动性。高CV模型天然评分嘈杂,需要更大的绝对变化才能达到显著性。低CV模型更可预测,因此即使小偏差也可能代表真实变化。
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.