Methodology

How we rank AI models. Every step is explained in plain language - no math required. We believe rankings should be transparent, so nothing here is hidden.

19 benchmarksArena Elo integratedBaseline normalizationUpdated hourly

The big picture

Every model on LM Market Cap receives a score from 0 to 100 called the LMC Score. This score is primarily driven by standardized benchmark results (90% weight) from multiple independent sources including HuggingFace Open LLM Leaderboard, LMSYS Chatbot Arena, and curated official evaluations.

Rankings are grounded in empirical benchmark data, not sentiment or hype. Models are evaluated on standardized tests that measure real capabilities across coding, reasoning, math, knowledge, and instruction-following.

How it works, in three sentences:

We collect benchmark scores from multiple independent sources. Each score is baseline-normalized to remove random-guessing advantages, then scaled to 0-100.
Arena Elo (human preference) gets 30% of benchmark weight; task benchmarks share the remaining 70%. This balances empirical measurement with human judgment.
The combined benchmark score accounts for 90% of the final LMC Score. Capabilities (5%) and context window (5%) serve as tiebreakers. Models without benchmark data are capped at 40.

The scoring formula

LMC Score = Benchmarks × 0.90 + Capabilities × 0.05 + Context × 0.05

Benchmarks = Arena Elo × 0.30 + Task Avg × 0.70

When Arena Elo is unavailable, task benchmarks carry the full 100% weight. When task benchmarks are unavailable, Arena Elo carries 100%.

90%

Benchmarks

19 standardized benchmarks + Arena Elo, normalized and averaged. This is the primary ranking signal - empirical performance on coding, math, reasoning, and knowledge tasks.

Capabilities

Supported features: vision, function calling, streaming, JSON mode, reasoning, web search, image output. Normalized against the most capable model in the category.

Context window

Log-scaled context window size, normalized against the largest in the category. A 1M-token model scores higher than 128K, but the difference is compressed by the log scale.

Data processing pipeline

Each benchmark score passes through this pipeline before contributing to the final LMC Score:

Raw scores

Benchmark results from sources

Baseline subtract

Remove random-guessing floor

Normalize

Scale to 0-100 per benchmark

Average

Mean across all benchmarks

Arena blend

Elo 30% + tasks 70%

Variant discount

Adjust inherited scores

Composite

90% bench + 5% cap + 5% ctx

LMC Score

Final 0-100 score

Baseline normalization

Not all benchmarks start from zero. Multiple-choice benchmarks give models a "free" baseline score from random guessing - for example, a 4-choice MCQ gives 25% for random answers. We subtract this baseline before normalizing, following the same approach used by HuggingFace's Open LLM Leaderboard.

Three normalization methods

MCQ baseline subtraction

For multiple-choice benchmarks, we subtract the random-guessing baseline and rescale:

(score - baseline) / (100 - baseline) × 100

Example: MMLU score of 85% becomes (85-25)/(100-25) × 100 = 80.0

Direct scaling (baseline 0)

Generative benchmarks (coding, math, instruction following) have no random-guessing advantage. Scores are used directly, capped at 100:

min(100, raw_score)

Example: HumanEval score of 92.1% stays 92.1

Field-max normalization

Some benchmarks have top scores far below 100% (like HLE where the best models score 4-39%). We normalize against the observed field maximum:

min(100, score / field_max × 100)

Example: HLE score of 36% with field max 45% becomes 80.0

The 19 benchmarks we track

Each model is evaluated against all available benchmarks. Not every model has scores for all 19 - we average across whatever is available. Models with fewer than 3 data points receive a small penalty to avoid inflated scores from cherry-picked benchmarks.

Benchmark	Type	Baseline	Description
MMLU	MCQ	25%	Massive Multitask Language Understanding - 57 subjects
MMLU-Pro	MCQ	10%	10-choice harder variant of MMLU
GPQA	MCQ	25%	Graduate-level science questions
GPQA Diamond	MCQ	25%	Hardest subset of GPQA
ARC-Challenge	MCQ	25%	Science reasoning (grade school)
HellaSwag	MCQ	25%	Commonsense reasoning completion
MATH-500	Generative	-	Competition-level math problems
GSM8K	Generative	-	Grade school math word problems
AIME 2024	Generative	-	American Invitational Math Exam
BBH	Generative	-	Big Bench Hard - diverse reasoning
IFEval	Generative	-	Instruction following evaluation
HumanEval	Generative	-	Python code generation
BigCodeBench	Generative	-	Complex coding tasks
SWE-bench Verified	Generative	-	Real GitHub issue resolution
SWE-bench Multilingual	Generative	-	Multi-language SWE tasks
Cursor Bench	Generative	-	AI coding assistant evaluation
Terminal Bench	Generative	-	Terminal/CLI task completion
LiveBench	Generative	-	Contamination-free live evaluation
HLE	Field-max	45% max	Humanity's Last Exam - frontier difficulty (4-39% range)

Arena Elo: the human preference signal

The LMSYS Chatbot Arena Elo rating is the single best holistic quality signal available. It reflects real human preferences from millions of blind head-to-head comparisons. We give it 30% of the total benchmark weight because:

It captures qualities that task benchmarks miss: helpfulness, naturalness, safety
It is harder to game than static benchmarks (real humans, blind evaluation)
It provides a single consistent signal across all model types

How we convert Elo to a percentile

percentile = clamp(0, 100, (elo - 900) / 6)

An Elo of 900 maps to 0 (floor), an Elo of 1500 maps to 100 (ceiling). Most competitive models fall between 1100-1400 Elo, mapping to roughly 33-83 on our scale.

Variant inheritance

Many models share a base architecture but come in different sizes or configurations (e.g., GPT-5, GPT-5 Pro, GPT-5 Mini). When a variant does not have its own benchmark data, it inherits the base model's scores with an appropriate discount:

No discount

Pro, Plus, Max, Preview, Exp, Latest, Deep Research

These variants are equal to or better than the base model, so no penalty is applied.

10% discount

Turbo

Speed-optimized variants that sacrifice some quality for faster inference.

15% discount

Mini, Nano, Lite, Flash

Smaller or distilled variants designed for efficiency over raw performance.

Once a variant receives its own benchmark evaluations, the inherited scores are replaced with real data and the discount is removed.

Sparse data handling

Models with only 1-2 benchmark scores can have inflated averages from cherry-picked evaluations. To prevent this, we apply a confidence penalty:

adjusted_score = raw_score × (0.70 + 0.30 × metrics_count / 3)

A model with only 1 metric gets a 20% penalty. With 2 metrics, a 10% penalty. With 3 or more metrics, no penalty is applied. This encourages comprehensive evaluation while not completely blocking models with limited data.

Models without benchmark data

Models with no benchmark scores at all receive a score based on capabilities (35%), context window (25%), max output tokens (15%), and recency (25%) - but the total is capped at 40. This ensures unproven models always rank below empirically evaluated ones.

Data sources

We aggregate data from multiple independent sources, cross-referencing to ensure accuracy:

Benchmark scores

HuggingFace Open LLM Leaderboard, official model papers and release announcements, LMSYS Chatbot Arena (Elo ratings), LiveBench continuous evaluation, SWE-bench Verified leaderboard

Model metadata

OpenRouter API (pricing, capabilities, context windows), refreshed hourly. 340+ models from 35+ providers tracked continuously.

How is this different from other leaderboards?

Most AI leaderboards focus on a single dimension - usually one benchmark suite or one arena. LM Market Cap combines them into a unified score:

Multi-source aggregation

We average across 19 benchmarks spanning knowledge, coding, math, reasoning, and instruction following. No single benchmark can dominate or distort the ranking.

Fair normalization

Random-baseline subtraction and field-max normalization ensure that MCQ benchmarks, generative benchmarks, and frontier-difficulty tests contribute equally on a fair 0-100 scale.

Human + machine signals

Arena Elo (human preference) and task benchmarks (automated evaluation) are blended 30/70 - combining the reliability of standardized tests with the nuance of human judgment.

Real-time updates

Model metadata refreshes hourly. Rankings update within hours of new benchmark results or pricing changes, not weeks or months.

Methodology changelog

We continuously improve our scoring methodology. Major changes:

Apr 2026v3

Baseline normalization + expanded benchmarks

Added random-baseline subtraction for MCQ benchmarks (HuggingFace approach), field-max normalization for HLE, and expanded from 17 to 19 tracked benchmarks (added HLE and BBH).

Apr 2026v2

Arena Elo integration

Added Arena Elo as 30% of benchmark weight. Previously all benchmarks were weighted equally, which allowed models with extreme math scores to rank above more well-rounded models.

Mar 2026v1

Benchmark-driven scoring

Initial release with benchmark-driven scoring (90%), capabilities (5%), and context window (5%). 17 benchmarks tracked. Models without benchmarks capped at 40.

Questions?

If something is unclear or you want to suggest improvements to our methodology, we want to hear from you.