Which LLM benchmarks still separate frontier models and which have quietly become noise. We compute each benchmark's discrimination score from the top-5 score spread, the saturation flag, and the ceiling behavior across 156 models. 16 standardized benchmarks analyzed.
Scores last refreshed 2026-04-14 (auto-updated every 6 hours from HuggingFace Open LLM Leaderboard, LMArena, and vendor benchmark reports).
Top-5 spread is the gap between the best and fifth-best model on each benchmark. A wide gap means the benchmark can still tell frontier models apart.
These benchmarks cluster the top 5 models within a single-digit range. They are no longer useful for ranking frontier LLMs.
Sorted by discrimination score, high to low. Saturated benchmarks are marked.
| Benchmark | N | Max | Top-5 spread | Status |
|---|---|---|---|---|
BigCodeBench BigCodeBench (Hard) | 53 | 72 | 22.1 | Active |
LiveBench LiveBench (Dynamic) | 53 | 87 | 7.3 | Active |
SWE-bench Verified Software Engineering Benchmark (Verified) | 44 | 84 | 5.7 | Active |
GPQA Diamond Graduate-Level Google-Proof Q&A (Diamond) | 47 | 91.9% | 4.2 | Active |
HLE Humanity's Last Exam | 28 | 39.0% | 3.8 | Active |
AIME 2024 American Invitational Mathematics Examination 2024 | 22 | 96.7% | 7.5 | Active |
Arena Elo LMSYS Chatbot Arena Elo Rating | 118 | 1503 | 22.0 | Active |
BBH BIG-Bench Hard | 53 | 93.1% | 1.6 | Weak |
MMLU-Pro MMLU Professional | 87 | 88.0% | 1.3 | Weak |
IFEval Instruction Following Evaluation | 50 | 94 | 1.0 | Weak |
MATH-500 MATH Benchmark (500-problem subset) | 49 | 99.0% | 1.7 | Weak |
MMLU Massive Multitask Language Understanding | 49 | 94.0% | 1.5 | Saturated |
HellaSwag HellaSwag Commonsense NLI | 7 | 96.0% | 3.0 | Saturated |
ARC-Challenge AI2 Reasoning Challenge (Challenge Set) | 8 | 96.9% | 1.8 | Saturated |
HumanEval HumanEval Code Generation | 52 | 98 | 1.0 | Saturated |
GSM8K Grade School Math 8K | 15 | 96.8% | 1.0 | Saturated |
Reference notes for the top-ranked benchmarks so you know what you are looking at when a vendor quotes a score.
Practical code generation requiring use of libraries, APIs, and complex program structures. The 'Hard' subset tests non-trivial engineering tasks.
Why it still matters: More realistic than HumanEval — tests practical programming skills including library usage, API calls, and multi-file reasoning.
Comprehensive benchmark across 6 categories (math, coding, reasoning, data analysis, instruction following, language) using contamination-resistant, regularly updated questions.
Why it still matters: Contamination-free by design — uses new questions regularly. Top models still score below 70%, making it highly discriminating.
Can a model resolve real GitHub issues from popular Python repositories? Human-validated subset ensures accurate evaluation. Tests end-to-end software engineering ability.
Why it still matters: The gold standard for real-world coding ability. Unlike HumanEval, tests understanding of large codebases, debugging, and complex changes. Scores range 20-80%.
Expert-level science reasoning across biology, chemistry, and physics at PhD level. Questions are designed to be 'Google-proof' — even domain experts with web access struggle.
Why it still matters: One of the best discriminators between models. Scores range widely (40-85%), making it highly informative for comparing reasoning ability.
2,500 expert-level questions spanning mathematics, sciences, and humanities. Designed to be 'the final closed-ended academic evaluation' that even top models fail most of.
Why it still matters: The hardest academic benchmark — top models still fail 60-65% of questions. Shows how far we are from genuine expert-level reasoning.
Olympiad-level mathematical problem solving from the real 2024 AIME competition. 30 problems testing advanced algebra, geometry, combinatorics, and number theory.
Why it still matters: Tests mathematical reasoning at competition level. Reasoning models achieve 70-90% while standard models struggle below 30%. Best differentiator for math ability.
The discrimination score is a 0-to-1 rating of how well a benchmark still separates frontier models in 2026. Three inputs feed into it:
Benchmarks with fewer than 5 scored models are excluded from ranking to avoid spurious top-5 spreads on small samples.
MMLU is flagged as saturated in the current LMC data. Across the 49 models we have scored, the maximum is 94.0%, the median is 88.5%, and the top-5 are clustered within 1.5 percentage points of each other. That spread is narrower than MMLU's own annotation noise floor, so it cannot reliably rank frontier LLMs. Use MMLU-Pro, GPQA Diamond, or LiveBench for ranking instead.
By top-5 spread, the most discriminating benchmarks in the current LMC data are BigCodeBench (top-5 spread 22.1), LiveBench (top-5 spread 7.3), SWE-bench Verified (top-5 spread 5.7). A wide spread means a frontier model that improves by a few points still meaningfully moves the ranking, which is the property you want for buyer decisions.
HumanEval in the current LMC data has a max of 97.5%, a median of 90.2%, and a top-5 spread of 1.0 points across 52 scored models. That kind of ceiling behavior means the remaining errors are largely ambiguous test cases rather than real coding failures. For coding evaluation today use SWE-bench Verified (top-5 spread 5.7), LiveCodeBench, or BigCodeBench instead.
We refresh benchmark scores every 6 hours from the HuggingFace Open LLM Leaderboard, LMArena text, vision, and image leaderboards, and vendor-published benchmark reports. The merged dataset is overlaid on the curated LMC benchmark definitions at render time, so every page load reflects the latest available scores. Benchmarks with fewer than 5 scored models are excluded from ranking to avoid spurious top-5 spreads. The current snapshot was fetched at 2026-04-14T06:30:08.587Z.