Not all rank changes are meaningful. Some are random noise. This page uses statistical analysis to tell you which model score movements are real trends vs. normal fluctuation, so you know which changes to pay attention to.
Models Analyzed
295
Significant Changes
0
Noise (Not Significant)
295
Both Timeframes
2
0 models whose recent scores deviate enough from their historical average to be considered a real change (not noise). Sorted by how extreme the change is. Z-Score measures how unusual the change is - values beyond ±1.96 mean there's a 95% chance the change is real.
No significant changes
All model score changes fall within normal statistical variance.
A model changing rank in 24 hours could be a blip. But if it's also moving over 7 days, that's a real trend. Models flagged on both timeframes are the most important to watch - they represent confirmed, sustained performance shifts.
| Model | Score | 24h Change | 7d Change |
|---|---|---|---|
| Mistral NemoMistral AI | 39.9 | -11 | -11 |
| GLM 5V TurboZhipu AI | 40.0 | -146 | +115 |
No models with daily-only significance.
No models with weekly-only significance.
Some models have naturally stable scores - even a small rank change for these models is meaningful. Others have volatile scores that bounce around - they need a bigger shift before you should care. CV% (coefficient of variation) tells you how volatile each model is. Higher = noisier.
| Model | Score | CV% |
|---|---|---|
| Gemma 2 27BGoogle | 77.1 | 57.6% |
| Qwen2.5 Coder 32B InstructAlibaba | 40.0 | 55.8% |
| Coder Largearcee-ai | 39.3 | 52.6% |
| Gemma 3n 4BGoogle | 40.0 | 51.8% |
| Llama 3.3 70B Instruct (free)Meta | 65.5 | 51.6% |
| Phi 4Microsoft | 59.9 | 51.5% |
| R1DeepSeek | 73.7 | 45.6% |
| GPT-4o Search PreviewOpenAI | 70.0 | 43.6% |
| Command ACohere | 50.4 | 42.1% |
| GPT-4OpenAI | 64.5 | 41.8% |
| Mixtral 8x22B InstructMistral AI | 63.0 | 41.2% |
| Llama 3.1 70B InstructMeta | 64.9 | 41.1% |
| Llama 3.3 70B InstructMeta | 66.4 | 40.7% |
| Mistral LargeMistral AI | 65.5 | 40.6% |
| MiniMax M2-herMiniMax | 68.8 | 40.5% |
| o3 MiniOpenAI | 74.9 | 39.2% |
| DeepSeek V3DeepSeek | 69.0 | 38.7% |
| GPT-4o-mini Search PreviewOpenAI | 60.4 | 38.7% |
| Nova Micro 1.0Amazon | 40.0 | 38.0% |
| GPT-4 Turbo PreviewOpenAI | 59.4 | 37.4% |
| Model | Score | CV% |
|---|---|---|
| Claude Fable 5Anthropic | 96.6 | 0.0% |
| Claude Opus 4.7 (Fast)Anthropic | 94.7 | 0.0% |
| Kimi K2.7 CodeMoonshot AI | 53.7 | 0.0% |
| Claude Fable Latest~anthropic | 40.0 | 0.0% |
| Nemotron 3.5 Content Safety (free)NVIDIA | 40.0 | 0.0% |
| Nemotron 3 Ultra (free)NVIDIA | 40.0 | 0.0% |
| Nemotron 3 UltraNVIDIA | 40.0 | 0.0% |
| Qwen3.7 PlusAlibaba | 40.0 | 0.0% |
| Step 3.7 FlashStepFun | 40.0 | 0.0% |
| Qwen3.7 MaxAlibaba | 40.0 | 0.0% |
| Grok Build 0.1xAI | 40.0 | 0.0% |
| Perceptron Mk1perceptron | 40.0 | 0.0% |
| Ring-2.6-1Tinclusionai | 40.0 | 0.0% |
| GPT Chat LatestOpenAI | 40.0 | 0.0% |
| Nemotron 3 Nano Omni (free)NVIDIA | 40.0 | 0.0% |
| Laguna XS.2 (free)poolside | 40.0 | 0.0% |
| Laguna M.1 (free)poolside | 40.0 | 0.0% |
| Anthropic Claude Haiku Latest~anthropic | 40.0 | 0.0% |
| OpenAI GPT Mini Latest~openai | 40.0 | 0.0% |
| Google Gemini Pro Latest~google | 40.0 | 0.0% |
Understanding the statistical methodology behind our significance analysis helps you distinguish real performance shifts from random fluctuations.
We use z-scores with a 95% confidence threshold (|z| > 1.96). A z-score measures how many standard deviations a model's current score is from its historical baseline. Only changes exceeding 1.96 standard deviations are flagged as statistically significant.
The baseline is computed as the arithmetic mean of each model's 14-day sparkline data. This rolling average smooths out daily fluctuations and provides a stable reference point for detecting meaningful deviations.
Each model's 95% confidence interval is calculated as baseline ± 1.96 × standard deviation. Scores falling outside this range indicate a statistically meaningful change. The "Confidence" column shows the ± threshold value.
Daily (24h) and weekly (7d) rank changes are analyzed separately. Daily significance requires a rank shift of more than 3 positions; weekly requires more than 5. Models significant on both timeframes represent the strongest, most reliable signals.
The coefficient of variation (CV%) measures relative volatility. High-CV models have naturally noisy scores and require larger absolute changes to be significant. Low-CV models are more predictable, so even small deviations may represent real shifts.
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.