LMC Benchmark Saturation Tracker

Which LLM benchmarks still separate frontier models and which have quietly become noise. We compute each benchmark's discrimination score from the top-5 score spread, the saturation flag, and the ceiling behavior across 160 models. 16 standardized benchmarks analyzed.

Scores last refreshed 2026-05-29 (auto-updated every 6 hours from HuggingFace Open LLM Leaderboard, LMArena, and vendor benchmark reports).

Tracked

benchmarks with N ≥ 5

Saturated

ceiling hit, low discrimination

Discriminating

still separate frontier models

Saturation rate

31%

of tracked benchmarks

Most discriminating benchmarks (wider top-5 spread = better)

Top-5 spread is the gap between the best and fifth-best model on each benchmark. A wide gap means the benchmark can still tell frontier models apart.

LMMarketCap.com

Most saturated benchmarks (narrower top-5 spread = dead)

These benchmarks cluster the top 5 models within a single-digit range. They are no longer useful for ranking frontier LLMs.

LMMarketCap.com

Full benchmark ranking

Sorted by discrimination score, high to low. Saturated benchmarks are marked.

Benchmark	Category	N	Max	Median	Top-5 spread	StDev	Status
BigCodeBench BigCodeBench (Hard)	coding	50	72	38	22.1	12.2	Active
HLE Humanity's Last Exam	reasoning	30	46.9%	23.7%	9.4	13.1	Active
SWE-bench Verified Software Engineering Benchmark (Verified)	coding	50	89	68	7.8	21.1	Active
LiveBench LiveBench (Dynamic)	arena	52	87	67	7.3	13.7	Active
AIME 2024 American Invitational Mathematics Examination 2024	math	22	96.7%	79.8%	7.5	15.2	Active
MMLU-Pro MMLU Professional	knowledge	85	91.2%	78.0%	3.4	18.0	Active
Arena Elo LMSYS Chatbot Arena Elo Rating	arena	124	1503	1380	17.0	86.7	Weak
BBH BIG-Bench Hard	reasoning	50	93.1%	85.9%	1.6	21.4	Weak
IFEval Instruction Following Evaluation	instruction	47	94	88	1.0	13.6	Weak
MATH-500 MATH Benchmark (500-problem subset)	math	49	99.0%	85.0%	1.7	11.3	Weak
GPQA Diamond Graduate-Level Google-Proof Q&A (Diamond)	reasoning	53	94.3%	78.0%	0.7	16.5	Weak
HellaSwag HellaSwag Commonsense NLI	reasoning	7	96.0%	94.8%	3.0	2.3	Saturated
MMLU Massive Multitask Language Understanding	knowledge	53	94.0%	88.7%	1.4	4.7	Saturated
ARC-Challenge AI2 Reasoning Challenge (Challenge Set)	reasoning	8	96.9%	95.5%	1.8	1.8	Saturated
HumanEval HumanEval Code Generation	coding	52	98	90	1.0	8.1	Saturated
GSM8K Grade School Math 8K	math	15	96.8%	94.6%	1.0	3.6	Saturated

What each benchmark actually measures

Reference notes for the top-ranked benchmarks so you know what you are looking at when a vendor quotes a score.

BigCodeBench

coding

Practical code generation requiring use of libraries, APIs, and complex program structures. The 'Hard' subset tests non-trivial engineering tasks.

Why it still matters: More realistic than HumanEval — tests practical programming skills including library usage, API calls, and multi-file reasoning.

N=50 modelstop-5 spread = 22.1max = 72.1%

HLE

reasoning

2,500 expert-level questions spanning mathematics, sciences, and humanities. Designed to be 'the final closed-ended academic evaluation' that even top models fail most of.

Why it still matters: The hardest academic benchmark — top models still fail 60-65% of questions. Shows how far we are from genuine expert-level reasoning.

N=30 modelstop-5 spread = 9.4max = 46.9%

SWE-bench Verified

coding

Can a model resolve real GitHub issues from popular Python repositories? Human-validated subset ensures accurate evaluation. Tests end-to-end software engineering ability.

Why it still matters: The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.

N=50 modelstop-5 spread = 7.8max = 88.7%

LiveBench

arena

Comprehensive benchmark across 6 categories (math, coding, reasoning, data analysis, instruction following, language) using contamination-resistant, regularly updated questions.

Why it still matters: Contamination-free by design — uses new questions regularly. Top models still score below 70%, making it highly discriminating.

N=52 modelstop-5 spread = 7.3max = 87.3%

AIME 2024

math

Olympiad-level mathematical problem solving from the real 2024 AIME competition. 30 problems testing advanced algebra, geometry, combinatorics, and number theory.

Why it still matters: Tests mathematical reasoning at competition level. Reasoning models achieve 70-90% while standard models struggle below 30%. Best differentiator for math ability.

N=22 modelstop-5 spread = 7.5max = 96.7%

MMLU-Pro

knowledge

Harder version of MMLU with reasoning-focused questions and 10 answer choices instead of 4. Contains 12,000+ questions across 14 domains.

Why it still matters: Better at differentiating top models since scores are 16-33% lower than standard MMLU. Tests reasoning in addition to knowledge.

N=85 modelstop-5 spread = 3.4max = 91.2%

How the discrimination score is computed

The discrimination score is a 0-to-1 rating of how well a benchmark still separates frontier models in 2026. Three inputs feed into it:

Top-5 score spread. The gap between the best-scoring model and the fifth-best on a benchmark. If the top 5 are within a single percentage point of each other, the benchmark cannot tell them apart. Wider gaps mean more discrimination.
Saturation flag. Benchmarks we have flagged as saturated in the catalog receive a 0.4x multiplier. Flagged benchmarks include MMLU, HumanEval, GSM8K, ARC-Challenge, and HellaSwag - all of which have multiple frontier models scoring at or above the human baseline.
Ceiling hit. If the best model is within 5% of the benchmark's theoretical max, we apply a 0.5x penalty because further progress is compressed into a very narrow score range.

Benchmarks with fewer than 5 scored models are excluded from ranking to avoid spurious top-5 spreads on small samples.

Capability Drift Price Convergence Index All Benchmarks How Benchmarks Work LMC Methodology

Frequently Asked Questions

MMLU is flagged as saturated in the current LMC data. Across the 53 models we have scored, the maximum is 94.0%, the median is 88.7%, and the top-5 are clustered within 1.4 percentage points of each other. That spread is narrower than MMLU's own annotation noise floor, so it cannot reliably rank frontier LLMs. Use MMLU-Pro, GPQA Diamond, or LiveBench for ranking instead.

By top-5 spread, the most discriminating benchmarks in the current LMC data are BigCodeBench (top-5 spread 22.1), HLE (top-5 spread 9.4), SWE-bench Verified (top-5 spread 7.8). A wide spread means a frontier model that improves by a few points still meaningfully moves the ranking, which is the property you want for buyer decisions.

HumanEval in the current LMC data has a max of 97.5%, a median of 90.2%, and a top-5 spread of 1.0 points across 52 scored models. That kind of ceiling behavior means the remaining errors are largely ambiguous test cases rather than real coding failures. For coding evaluation today use SWE-bench Verified (top-5 spread 7.8), LiveCodeBench, or BigCodeBench instead.

We refresh benchmark scores every 6 hours from the HuggingFace Open LLM Leaderboard, LMArena text, vision, and image leaderboards, and vendor-published benchmark reports. The merged dataset is overlaid on the curated LMC benchmark definitions at render time, so every page load reflects the latest available scores. Benchmarks with fewer than 5 scored models are excluded from ranking to avoid spurious top-5 spreads. The current snapshot was fetched at 2026-05-29T18:30:07.555Z.

Benchmark

Max

Top-5 spread

Status

BigCodeBench

BigCodeBench (Hard)

22.1

Active

HLE

Humanity's Last Exam

46.9%

9.4

Active

SWE-bench Verified

Software Engineering Benchmark (Verified)

7.8

Active

LiveBench

LiveBench (Dynamic)

7.3

Active

AIME 2024

American Invitational Mathematics Examination 2024

96.7%

7.5

Active

MMLU-Pro

MMLU Professional

91.2%

3.4

Active

Arena Elo

LMSYS Chatbot Arena Elo Rating

124

1503

17.0

Weak

BBH

BIG-Bench Hard

93.1%

1.6

Weak

IFEval

Instruction Following Evaluation

1.0

Weak

MATH-500

MATH Benchmark (500-problem subset)

99.0%

1.7

Weak

GPQA Diamond

Graduate-Level Google-Proof Q&A (Diamond)

94.3%

0.7

Weak

HellaSwag

HellaSwag Commonsense NLI

96.0%

3.0

Saturated

MMLU

Massive Multitask Language Understanding

94.0%

1.4

Saturated

ARC-Challenge

AI2 Reasoning Challenge (Challenge Set)

96.9%

1.8

Saturated

HumanEval

HumanEval Code Generation

1.0

Saturated

GSM8K

Grade School Math 8K

96.8%

1.0

Saturated

How the discrimination score is computed

The discrimination score is a 0-to-1 rating of how well a benchmark still separates frontier models in 2026. Three inputs feed into it:

Top-5 score spread. The gap between the best-scoring model and the fifth-best on a benchmark. If the top 5 are within a single percentage point of each other, the benchmark cannot tell them apart. Wider gaps mean more discrimination.
Saturation flag. Benchmarks we have flagged as saturated in the catalog receive a 0.4x multiplier. Flagged benchmarks include MMLU, HumanEval, GSM8K, ARC-Challenge, and HellaSwag - all of which have multiple frontier models scoring at or above the human baseline.
Ceiling hit. If the best model is within 5% of the benchmark's theoretical max, we apply a 0.5x penalty because further progress is compressed into a very narrow score range.

Benchmarks with fewer than 5 scored models are excluded from ranking to avoid spurious top-5 spreads on small samples.

LMC Benchmark Saturation Tracker

Most discriminating benchmarks (wider top-5 spread = better)

Most saturated benchmarks (narrower top-5 spread = dead)

Full benchmark ranking

What each benchmark actually measures

BigCodeBench

HLE

SWE-bench Verified

LiveBench

AIME 2024

MMLU-Pro

How the discrimination score is computed

Related

LMC Benchmark Saturation Tracker

Most discriminating benchmarks (wider top-5 spread = better)

Most saturated benchmarks (narrower top-5 spread = dead)

Full benchmark ranking

What each benchmark actually measures

BigCodeBench

HLE

SWE-bench Verified

LiveBench

AIME 2024

MMLU-Pro

How the discrimination score is computed

Related