Not all rank changes are meaningful. Some are random noise. This page uses statistical analysis to tell you which model score movements are real trends vs. normal fluctuation, so you know which changes to pay attention to.
Models Analyzed
300
Significant Changes
35
Noise (Not Significant)
265
Both Timeframes
130
35 models whose recent scores deviate enough from their historical average to be considered a real change (not noise). Sorted by how extreme the change is. Z-Score measures how unusual the change is - values beyond ±1.96 mean there's a 95% chance the change is real.
A model changing rank in 24 hours could be a blip. But if it's also moving over 7 days, that's a real trend. Models flagged on both timeframes are the most important to watch - they represent confirmed, sustained performance shifts.
Some models have naturally stable scores - even a small rank change for these models is meaningful. Others have volatile scores that bounce around - they need a bigger shift before you should care. CV% (coefficient of variation) tells you how volatile each model is. Higher = noisier.
| Model | Score | CV% |
|---|---|---|
| Mistral 7B Instruct v0.1Mistral AI | 19.8 | 6.8% |
| Llama 3.2 1B InstructMeta | 31.8 | 5.9% |
| GPT-3.5 Turbo InstructOpenAI | 32.2 | 5.1% |
| autofixer-01Vercel | 38.8 | 4.8% |
| GPT-4OpenAI | 39.0 | 4.8% |
| Inflection 3 ProductivityInflection | 36.6 | 4.5% |
| SabaMistral AI | 52.7 | 4.1% |
| Coder Largearcee-ai | 45.3 | 4.1% |
| GPT-3.5 Turbo 16kOpenAI | 39.9 | 4.1% |
| Llama 3.3 70B Instruct (free)Meta | 43.9 | 4.1% |
| GPT-3.5 TurboOpenAI | 39.9 | 4.0% |
| Qwen2.5 Coder 7B InstructAlibaba | 42.8 | 3.8% |
| Mistral Large 2407Mistral AI | 52.9 | 3.8% |
| Qwen2.5 7B InstructAlibaba | 45.5 | 3.7% |
| GPT-4o (2024-08-06)OpenAI | 55.4 | 3.6% |
| Pixtral Large 2411Mistral AI | 55.6 | 3.6% |
| SWE-1.5Windsurf | 49.1 | 3.6% |
| Mistral LargeMistral AI | 54.1 | 3.5% |
| Command R+ (08-2024)Cohere | 47.6 | 3.5% |
| Command R (08-2024)Cohere | 47.6 | 3.5% |
| Model | Score | CV% |
|---|---|---|
| Grok 4 FastxAI | 83.2 | 0.4% |
| SonarPerplexity | 53.5 | 0.5% |
| Sonar Pro SearchPerplexity | 85.0 | 0.5% |
| Gemini 3.1 Pro Preview Custom ToolsGoogle | 85.0 | 0.5% |
| Gemini 3.1 Flash Lite PreviewGoogle | 81.9 | 0.5% |
| Grok 4.20 BetaxAI | 85.7 | 0.5% |
| Gemini 2.5 Pro Preview 06-05Google | 84.1 | 0.6% |
| Gemini 2.5 Flash Lite Preview 09-2025Google | 83.6 | 0.6% |
| Gemini 2.5 Pro Preview 05-06Google | 82.5 | 0.6% |
| Solar Pro 3Upstage | 72.5 | 0.6% |
| GPT-5.4OpenAI | 94.0 | 0.6% |
| Gemini 2.5 FlashGoogle | 80.0 | 0.6% |
| GPT-5.2OpenAI | 92.7 | 0.6% |
| Nemotron 3 Nano 30B A3B (free)NVIDIA | 67.7 | 0.6% |
| GPT-5 ProOpenAI | 91.9 | 0.6% |
| Kimi K2.5Moonshot AI | 85.0 | 0.6% |
| Qwen3 Coder PlusAlibaba | 78.4 | 0.6% |
| Qwen3.5 Plus 2026-02-15Alibaba | 85.0 | 0.6% |
| Qwen3 Coder FlashAlibaba | 78.1 | 0.6% |
| Qwen3 Coder NextAlibaba | 76.7 | 0.6% |
Understanding the statistical methodology behind our significance analysis helps you distinguish real performance shifts from random fluctuations.
We use z-scores with a 95% confidence threshold (|z| > 1.96). A z-score measures how many standard deviations a model's current score is from its historical baseline. Only changes exceeding 1.96 standard deviations are flagged as statistically significant.
The baseline is computed as the arithmetic mean of each model's 14-day sparkline data. This rolling average smooths out daily fluctuations and provides a stable reference point for detecting meaningful deviations.
Each model's 95% confidence interval is calculated as baseline ± 1.96 × standard deviation. Scores falling outside this range indicate a statistically meaningful change. The "Confidence" column shows the ± threshold value.
Daily (24h) and weekly (7d) rank changes are analyzed separately. Daily significance requires a rank shift of more than 3 positions; weekly requires more than 5. Models significant on both timeframes represent the strongest, most reliable signals.
The coefficient of variation (CV%) measures relative volatility. High-CV models have naturally noisy scores and require larger absolute changes to be significant. Low-CV models are more predictable, so even small deviations may represent real shifts.
Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.
A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.
The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.