Not all rank changes are meaningful. Some are random noise. This page uses statistical analysis to tell you which model score movements are real trends vs. normal fluctuation, so you know which changes to pay attention to.
Models Analyzed
276
Significant Changes
0
Noise (Not Significant)
276
Both Timeframes
0
0 models whose recent scores deviate enough from their historical average to be considered a real change (not noise). Sorted by how extreme the change is. Z-Score measures how unusual the change is - values beyond ±1.96 mean there's a 95% chance the change is real.
No significant changes
All model score changes fall within normal statistical variance.
A model changing rank in 24 hours could be a blip. But if it's also moving over 7 days, that's a real trend. Models flagged on both timeframes are the most important to watch - they represent confirmed, sustained performance shifts.
No models currently significant on both daily and weekly timeframes.
No models with daily-only significance.
No models with weekly-only significance.
Some models have naturally stable scores - even a small rank change for these models is meaningful. Others have volatile scores that bounce around - they need a bigger shift before you should care. CV% (coefficient of variation) tells you how volatile each model is. Higher = noisier.
| Model | Score | CV% |
|---|---|---|
| Gemma 2 27BGoogle | 77.4 | 80.4% |
| Coder Largearcee-ai | 40.0 | 74.2% |
| Llama Guard 3 8BMeta | 40.0 | 72.7% |
| Llama 3 70B InstructMeta | 57.0 | 69.9% |
| Gemma 3n 4BGoogle | 40.0 | 69.3% |
| Llama 3.3 70B Instruct (free)Meta | 65.7 | 69.1% |
| Phi 4Microsoft | 60.2 | 68.9% |
| R1DeepSeek | 73.0 | 58.3% |
| GPT-4o Search PreviewOpenAI | 70.4 | 55.4% |
| ERNIE 4.5 300B A47B Baidu | 59.6 | 54.2% |
| Command ACohere | 50.8 | 53.1% |
| GPT-4OpenAI | 64.8 | 52.5% |
| GPT-4 (older v0314)OpenAI | 64.8 | 52.3% |
| Mixtral 8x22B InstructMistral AI | 63.4 | 51.5% |
| Llama 3.1 70B InstructMeta | 65.3 | 51.4% |
| Llama 3.3 70B InstructMeta | 66.8 | 50.8% |
| Mistral LargeMistral AI | 65.9 | 50.6% |
| MiniMax M2-herMiniMax | 69.1 | 50.4% |
| o3 MiniOpenAI | 74.5 | 48.3% |
| DeepSeek V3DeepSeek | 69.5 | 47.8% |
| Model | Score | CV% |
|---|---|---|
| Claude Opus 4.6 (Fast)Anthropic | 90.4 | 0.0% |
| Gemma 4 31B (free)Google | 80.5 | 0.0% |
| DeepSeek V4 ProDeepSeek | 75.7 | 0.0% |
| Trinity Large Previewarcee-ai | 63.6 | 0.0% |
| Claude Opus Latest~anthropic | 40.0 | 0.0% |
| Qianfan-OCR-Fast (free)Baidu | 40.0 | 0.0% |
| Grok 4.3xAI | 76.4 | 0.1% |
| DeepSeek V4 FlashDeepSeek | 72.1 | 0.1% |
| Gemma 4 26B A4B (free)Google | 73.0 | 0.1% |
| GLM 5.1Zhipu AI | 76.1 | 0.1% |
| Kimi K2.6Moonshot AI | 75.9 | 0.3% |
| Grok 4.1 FastxAI | 78.0 | 1.1% |
| Claude Opus 4.7Anthropic | 79.3 | 1.3% |
| MiniMax-01MiniMax | 40.0 | 1.5% |
| Qwen3.5-FlashAlibaba | 68.7 | 1.6% |
| Nemotron 3 Nano 30B A3B (free)NVIDIA | 40.0 | 1.7% |
| Grok 4.20xAI | 88.8 | 1.7% |
| Kimi K2.5Moonshot AI | 59.1 | 2.0% |
| Grok 4 FastxAI | 72.5 | 2.0% |
| Nemotron Nano 9B V2NVIDIA | 40.0 | 2.1% |
Understanding the statistical methodology behind our significance analysis helps you distinguish real performance shifts from random fluctuations.
We use z-scores with a 95% confidence threshold (|z| > 1.96). A z-score measures how many standard deviations a model's current score is from its historical baseline. Only changes exceeding 1.96 standard deviations are flagged as statistically significant.
The baseline is computed as the arithmetic mean of each model's 14-day sparkline data. This rolling average smooths out daily fluctuations and provides a stable reference point for detecting meaningful deviations.
Each model's 95% confidence interval is calculated as baseline ± 1.96 × standard deviation. Scores falling outside this range indicate a statistically meaningful change. The "Confidence" column shows the ± threshold value.
Daily (24h) and weekly (7d) rank changes are analyzed separately. Daily significance requires a rank shift of more than 3 positions; weekly requires more than 5. Models significant on both timeframes represent the strongest, most reliable signals.
The coefficient of variation (CV%) measures relative volatility. High-CV models have naturally noisy scores and require larger absolute changes to be significant. Low-CV models are more predictable, so even small deviations may represent real shifts.
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.