Change Significance Tracker

Not all rank changes are meaningful. Some are random noise. This page uses statistical analysis to tell you which model score movements are real trends vs. normal fluctuation, so you know which changes to pay attention to.

Models Analyzed

276

Significant Changes

Noise (Not Significant)

276

Both Timeframes

What This Means

0 of 276 models have score changes that are statistically significant - these are real performance shifts, not random noise.
276 models show score changes within their normal variation range - don't read too much into small rank shifts for these models.
How to use this: When a model's rank changes, check here first. If it's not flagged as significant, the change is likely temporary noise. If it is significant, the model is genuinely improving or declining.

Real Performance Shifts

0 models whose recent scores deviate enough from their historical average to be considered a real change (not noise). Sorted by how extreme the change is. Z-Score measures how unusual the change is - values beyond ±1.96 mean there's a 95% chance the change is real.

No significant changes

All model score changes fall within normal statistical variance.

Short-Term vs. Sustained Changes

A model changing rank in 24 hours could be a blip. But if it's also moving over 7 days, that's a real trend. Models flagged on both timeframes are the most important to watch - they represent confirmed, sustained performance shifts.

Significant on Both Timeframes(strongest signals)

No models currently significant on both daily and weekly timeframes.

Daily Only(may be noise)

No models with daily-only significance.

Weekly Only(building trend)

No models with weekly-only significance.

Which Models Are Noisy vs. Consistent?

Some models have naturally stable scores - even a small rank change for these models is meaningful. Others have volatile scores that bounce around - they need a bigger shift before you should care. CV% (coefficient of variation) tells you how volatile each model is. Higher = noisier.

Noisiest Models(highest CV% - widest significance thresholds)

Model	Provider	Score	CV%	Std Dev	Sig. Threshold
Gemma 2 27BGoogle	Google	77.4	80.4%	30.17	±59.13
Coder Largearcee-ai	arcee-ai	40.0	74.2%	14.98	±29.36
Llama Guard 3 8BMeta	Meta	40.0	72.7%	14.84	±29.08
Llama 3 70B InstructMeta	Meta	57.0	69.9%	20.76	±40.69
Gemma 3n 4BGoogle	Google	40.0	69.3%	14.48	±28.38
Llama 3.3 70B Instruct (free)Meta	Meta	65.7	69.1%	23.72	±46.50
Phi 4Microsoft	Microsoft	60.2	68.9%	21.71	±42.56
R1DeepSeek	DeepSeek	73.0	58.3%	24.02	±47.08
GPT-4o Search PreviewOpenAI	OpenAI	70.4	55.4%	22.52	±44.14
ERNIE 4.5 300B A47B Baidu	Baidu	59.6	54.2%	18.94	±37.12
Command ACohere	Cohere	50.8	53.1%	15.85	±31.06
GPT-4OpenAI	OpenAI	64.8	52.5%	20.09	±39.39
GPT-4 (older v0314)OpenAI	OpenAI	64.8	52.3%	20.04	±39.29
Mixtral 8x22B InstructMistral AI	Mistral AI	63.4	51.5%	19.43	±38.07
Llama 3.1 70B InstructMeta	Meta	65.3	51.4%	19.98	±39.15
Llama 3.3 70B InstructMeta	Meta	66.8	50.8%	20.31	±39.80
Mistral LargeMistral AI	Mistral AI	65.9	50.6%	19.99	±39.18
MiniMax M2-herMiniMax	MiniMax	69.1	50.4%	20.91	±40.98
o3 MiniOpenAI	OpenAI	74.5	48.3%	22.03	±43.17
DeepSeek V3DeepSeek	DeepSeek	69.5	47.8%	20.35	±39.88

Most Consistent Models(lowest CV% - tightest significance thresholds)

Model	Provider	Score	CV%	Std Dev	Sig. Threshold
Claude Opus 4.6 (Fast)Anthropic	Anthropic	90.4	0.0%	0.00	±0.00
Gemma 4 31B (free)Google	Google	80.5	0.0%	0.00	±0.00
DeepSeek V4 ProDeepSeek	DeepSeek	75.7	0.0%	0.00	±0.00
Trinity Large Previewarcee-ai	arcee-ai	63.6	0.0%	0.00	±0.00
Claude Opus Latest~anthropic	~anthropic	40.0	0.0%	0.00	±0.00
Qianfan-OCR-Fast (free)Baidu	Baidu	40.0	0.0%	0.00	±0.00
Grok 4.3xAI	xAI	76.4	0.1%	0.05	±0.10
DeepSeek V4 FlashDeepSeek	DeepSeek	72.1	0.1%	0.05	±0.09
Gemma 4 26B A4B (free)Google	Google	73.0	0.1%	0.05	±0.10
GLM 5.1Zhipu AI	Zhipu AI	76.1	0.1%	0.11	±0.21
Kimi K2.6Moonshot AI	Moonshot AI	75.9	0.3%	0.21	±0.40
Grok 4.1 FastxAI	xAI	78.0	1.1%	0.85	±1.66
Claude Opus 4.7Anthropic	Anthropic	79.3	1.3%	1.01	±1.99
MiniMax-01MiniMax	MiniMax	40.0	1.5%	0.61	±1.19
Qwen3.5-FlashAlibaba	Alibaba	68.7	1.6%	1.08	±2.13
Nemotron 3 Nano 30B A3B (free)NVIDIA	NVIDIA	40.0	1.7%	0.65	±1.28
Grok 4.20xAI	xAI	88.8	1.7%	1.52	±2.98
Kimi K2.5Moonshot AI	Moonshot AI	59.1	2.0%	1.17	±2.30
Grok 4 FastxAI	xAI	72.5	2.0%	1.51	±2.95
Nemotron Nano 9B V2NVIDIA	NVIDIA	40.0	2.1%	0.85	±1.66

How Significance Is Calculated

Understanding the statistical methodology behind our significance analysis helps you distinguish real performance shifts from random fluctuations.

Statistical Significance

We use z-scores with a 95% confidence threshold (|z| > 1.96). A z-score measures how many standard deviations a model's current score is from its historical baseline. Only changes exceeding 1.96 standard deviations are flagged as statistically significant.

Baseline Score

The baseline is computed as the arithmetic mean of each model's 14-day sparkline data. This rolling average smooths out daily fluctuations and provides a stable reference point for detecting meaningful deviations.

Confidence Intervals

Each model's 95% confidence interval is calculated as baseline ± 1.96 × standard deviation. Scores falling outside this range indicate a statistically meaningful change. The "Confidence" column shows the ± threshold value.

Multi-Timeframe Analysis

Daily (24h) and weekly (7d) rank changes are analyzed separately. Daily significance requires a rank shift of more than 3 positions; weekly requires more than 5. Models significant on both timeframes represent the strongest, most reliable signals.

Noise vs. Signal

The coefficient of variation (CV%) measures relative volatility. High-CV models have naturally noisy scores and require larger absolute changes to be significant. Low-CV models are more predictable, so even small deviations may represent real shifts.

All Trackers

Coding, image, and video model trackers

Degradation Tracker

Detect when AI models may be getting worse

Stability Tracker

Track model ranking stability and consistency

Frequently Asked Questions

Statistical significance indicates whether a model's rank change represents a real performance shift or is just random noise. We use z-scores with a 95% confidence threshold (|z| > 1.96), meaning a change is only flagged as significant if there is less than a 5% chance it occurred by random variation.

A z-score measures how many standard deviations a model's current score deviates from its historical baseline. It is calculated as (current score - baseline mean) / standard deviation. Values above +1.96 indicate significant improvement, while values below -1.96 indicate significant decline.

The CV% measures a model's relative score volatility. A high CV% means the model's performance fluctuates a lot, requiring larger changes to be statistically significant. A low CV% means the model is very consistent, so even small deviations may represent meaningful shifts. This helps distinguish inherently noisy models from truly changing ones.

Real Performance Shifts

No significant changes

All model score changes fall within normal statistical variance.

Short-Term vs. Sustained Changes

Significant on Both Timeframes(strongest signals)

No models currently significant on both daily and weekly timeframes.

Daily Only(may be noise)

No models with daily-only significance.

Weekly Only(building trend)

No models with weekly-only significance.

Which Models Are Noisy vs. Consistent?

Noisiest Models(highest CV% - widest significance thresholds)

Model	Provider	Score	CV%	Std Dev	Sig. Threshold
Gemma 2 27BGoogle	Google	77.4	80.4%	30.17	±59.13
Coder Largearcee-ai	arcee-ai	40.0	74.2%	14.98	±29.36
Llama Guard 3 8BMeta	Meta	40.0	72.7%	14.84	±29.08
Llama 3 70B InstructMeta	Meta	57.0	69.9%	20.76	±40.69
Gemma 3n 4BGoogle	Google	40.0	69.3%	14.48	±28.38
Llama 3.3 70B Instruct (free)Meta	Meta	65.7	69.1%	23.72	±46.50
Phi 4Microsoft	Microsoft	60.2	68.9%	21.71	±42.56
R1DeepSeek	DeepSeek	73.0	58.3%	24.02	±47.08
GPT-4o Search PreviewOpenAI	OpenAI	70.4	55.4%	22.52	±44.14
ERNIE 4.5 300B A47B Baidu	Baidu	59.6	54.2%	18.94	±37.12
Command ACohere	Cohere	50.8	53.1%	15.85	±31.06
GPT-4OpenAI	OpenAI	64.8	52.5%	20.09	±39.39
GPT-4 (older v0314)OpenAI	OpenAI	64.8	52.3%	20.04	±39.29
Mixtral 8x22B InstructMistral AI	Mistral AI	63.4	51.5%	19.43	±38.07
Llama 3.1 70B InstructMeta	Meta	65.3	51.4%	19.98	±39.15
Llama 3.3 70B InstructMeta	Meta	66.8	50.8%	20.31	±39.80
Mistral LargeMistral AI	Mistral AI	65.9	50.6%	19.99	±39.18
MiniMax M2-herMiniMax	MiniMax	69.1	50.4%	20.91	±40.98
o3 MiniOpenAI	OpenAI	74.5	48.3%	22.03	±43.17
DeepSeek V3DeepSeek	DeepSeek	69.5	47.8%	20.35	±39.88

Most Consistent Models(lowest CV% - tightest significance thresholds)

Model	Provider	Score	CV%	Std Dev	Sig. Threshold
Claude Opus 4.6 (Fast)Anthropic	Anthropic	90.4	0.0%	0.00	±0.00
Gemma 4 31B (free)Google	Google	80.5	0.0%	0.00	±0.00
DeepSeek V4 ProDeepSeek	DeepSeek	75.7	0.0%	0.00	±0.00
Trinity Large Previewarcee-ai	arcee-ai	63.6	0.0%	0.00	±0.00
Claude Opus Latest~anthropic	~anthropic	40.0	0.0%	0.00	±0.00
Qianfan-OCR-Fast (free)Baidu	Baidu	40.0	0.0%	0.00	±0.00
Grok 4.3xAI	xAI	76.4	0.1%	0.05	±0.10
DeepSeek V4 FlashDeepSeek	DeepSeek	72.1	0.1%	0.05	±0.09
Gemma 4 26B A4B (free)Google	Google	73.0	0.1%	0.05	±0.10
GLM 5.1Zhipu AI	Zhipu AI	76.1	0.1%	0.11	±0.21
Kimi K2.6Moonshot AI	Moonshot AI	75.9	0.3%	0.21	±0.40
Grok 4.1 FastxAI	xAI	78.0	1.1%	0.85	±1.66
Claude Opus 4.7Anthropic	Anthropic	79.3	1.3%	1.01	±1.99
MiniMax-01MiniMax	MiniMax	40.0	1.5%	0.61	±1.19
Qwen3.5-FlashAlibaba	Alibaba	68.7	1.6%	1.08	±2.13
Nemotron 3 Nano 30B A3B (free)NVIDIA	NVIDIA	40.0	1.7%	0.65	±1.28
Grok 4.20xAI	xAI	88.8	1.7%	1.52	±2.98
Kimi K2.5Moonshot AI	Moonshot AI	59.1	2.0%	1.17	±2.30
Grok 4 FastxAI	xAI	72.5	2.0%	1.51	±2.95
Nemotron Nano 9B V2NVIDIA	NVIDIA	40.0	2.1%	0.85	±1.66

How Significance Is Calculated

Understanding the statistical methodology behind our significance analysis helps you distinguish real performance shifts from random fluctuations.

Change Significance Tracker

What This Means

Real Performance Shifts

Short-Term vs. Sustained Changes

Significant on Both Timeframes(strongest signals)

Daily Only(may be noise)

Weekly Only(building trend)

Which Models Are Noisy vs. Consistent?

Noisiest Models(highest CV% - widest significance thresholds)

Most Consistent Models(lowest CV% - tightest significance thresholds)

How Significance Is Calculated

Statistical Significance

Baseline Score

Confidence Intervals

Multi-Timeframe Analysis

Noise vs. Signal

Related

Change Significance Tracker

What This Means

Real Performance Shifts

Short-Term vs. Sustained Changes

Significant on Both Timeframes(strongest signals)

Daily Only(may be noise)

Weekly Only(building trend)

Which Models Are Noisy vs. Consistent?

Noisiest Models(highest CV% - widest significance thresholds)

Most Consistent Models(lowest CV% - tightest significance thresholds)

How Significance Is Calculated

Statistical Significance

Baseline Score

Confidence Intervals

Multi-Timeframe Analysis

Noise vs. Signal

Related