How we rank AI models. Every step is explained in plain language - no math required. We believe rankings should be transparent, so nothing here is hidden.
Every model on LM Market Cap receives a score from 0 to 100 called the LMC Score. This score is primarily driven by standardized benchmark results (90% weight) from multiple independent sources including HuggingFace Open LLM Leaderboard, LMSYS Chatbot Arena, and curated official evaluations.
Rankings are grounded in empirical benchmark data, not sentiment or hype. Models are evaluated on standardized tests that measure real capabilities across coding, reasoning, math, knowledge, and instruction-following.
How it works, in three sentences:
When Arena Elo is unavailable, task benchmarks carry the full 100% weight. When task benchmarks are unavailable, Arena Elo carries 100%.
19 standardized benchmarks + Arena Elo, normalized and averaged. This is the primary ranking signal - empirical performance on coding, math, reasoning, and knowledge tasks.
Supported features: vision, function calling, streaming, JSON mode, reasoning, web search, image output. Normalized against the most capable model in the category.
Log-scaled context window size, normalized against the largest in the category. A 1M-token model scores higher than 128K, but the difference is compressed by the log scale.
Each benchmark score passes through this pipeline before contributing to the final LMC Score:
Not all benchmarks start from zero. Multiple-choice benchmarks give models a "free" baseline score from random guessing - for example, a 4-choice MCQ gives 25% for random answers. We subtract this baseline before normalizing, following the same approach used by HuggingFace's Open LLM Leaderboard.
For multiple-choice benchmarks, we subtract the random-guessing baseline and rescale:
Example: MMLU score of 85% becomes (85-25)/(100-25) × 100 = 80.0
Generative benchmarks (coding, math, instruction following) have no random-guessing advantage. Scores are used directly, capped at 100:
Example: HumanEval score of 92.1% stays 92.1
Some benchmarks have top scores far below 100% (like HLE where the best models score 4-39%). We normalize against the observed field maximum:
Example: HLE score of 36% with field max 45% becomes 80.0
Each model is evaluated against all available benchmarks. Not every model has scores for all 19 - we average across whatever is available. Models with fewer than 3 data points receive a small penalty to avoid inflated scores from cherry-picked benchmarks.
| Benchmark | Type | Baseline | Description |
|---|---|---|---|
| MMLU | MCQ | 25% | Massive Multitask Language Understanding - 57 subjects |
| MMLU-Pro | MCQ | 10% | 10-choice harder variant of MMLU |
| GPQA | MCQ | 25% | Graduate-level science questions |
| GPQA Diamond | MCQ | 25% | Hardest subset of GPQA |
| ARC-Challenge | MCQ | 25% | Science reasoning (grade school) |
| HellaSwag | MCQ | 25% | Commonsense reasoning completion |
| MATH-500 | Generative | - | Competition-level math problems |
| GSM8K | Generative | - | Grade school math word problems |
| AIME 2024 | Generative | - | American Invitational Math Exam |
| BBH | Generative | - | Big Bench Hard - diverse reasoning |
| IFEval | Generative | - | Instruction following evaluation |
| HumanEval | Generative | - | Python code generation |
| BigCodeBench | Generative | - | Complex coding tasks |
| SWE-bench Verified | Generative | - | Real GitHub issue resolution |
| SWE-bench Multilingual | Generative | - | Multi-language SWE tasks |
| Cursor Bench | Generative | - | AI coding assistant evaluation |
| Terminal Bench | Generative | - | Terminal/CLI task completion |
| LiveBench | Generative | - | Contamination-free live evaluation |
| HLE | Field-max | 45% max | Humanity's Last Exam - frontier difficulty (4-39% range) |
The LMSYS Chatbot Arena Elo rating is the single best holistic quality signal available. It reflects real human preferences from millions of blind head-to-head comparisons. We give it 30% of the total benchmark weight because:
An Elo of 900 maps to 0 (floor), an Elo of 1500 maps to 100 (ceiling). Most competitive models fall between 1100-1400 Elo, mapping to roughly 33-83 on our scale.
Many models share a base architecture but come in different sizes or configurations (e.g., GPT-5, GPT-5 Pro, GPT-5 Mini). When a variant does not have its own benchmark data, it inherits the base model's scores with an appropriate discount:
Pro, Plus, Max, Preview, Exp, Latest, Deep Research
These variants are equal to or better than the base model, so no penalty is applied.
Turbo
Speed-optimized variants that sacrifice some quality for faster inference.
Mini, Nano, Lite, Flash
Smaller or distilled variants designed for efficiency over raw performance.
Once a variant receives its own benchmark evaluations, the inherited scores are replaced with real data and the discount is removed.
Models with only 1-2 benchmark scores can have inflated averages from cherry-picked evaluations. To prevent this, we apply a confidence penalty:
A model with only 1 metric gets a 20% penalty. With 2 metrics, a 10% penalty. With 3 or more metrics, no penalty is applied. This encourages comprehensive evaluation while not completely blocking models with limited data.
Models with no benchmark scores at all receive a score based on capabilities (35%), context window (25%), max output tokens (15%), and recency (25%) - but the total is capped at 40. This ensures unproven models always rank below empirically evaluated ones.
We aggregate data from multiple independent sources, cross-referencing to ensure accuracy:
HuggingFace Open LLM Leaderboard, official model papers and release announcements, LMSYS Chatbot Arena (Elo ratings), LiveBench continuous evaluation, SWE-bench Verified leaderboard
OpenRouter API (pricing, capabilities, context windows), refreshed hourly. 340+ models from 35+ providers tracked continuously.
Most AI leaderboards focus on a single dimension - usually one benchmark suite or one arena. LM Market Cap combines them into a unified score:
We average across 19 benchmarks spanning knowledge, coding, math, reasoning, and instruction following. No single benchmark can dominate or distort the ranking.
Random-baseline subtraction and field-max normalization ensure that MCQ benchmarks, generative benchmarks, and frontier-difficulty tests contribute equally on a fair 0-100 scale.
Arena Elo (human preference) and task benchmarks (automated evaluation) are blended 30/70 - combining the reliability of standardized tests with the nuance of human judgment.
Model metadata refreshes hourly. Rankings update within hours of new benchmark results or pricing changes, not weeks or months.
We continuously improve our scoring methodology. Major changes:
Baseline normalization + expanded benchmarks
Added random-baseline subtraction for MCQ benchmarks (HuggingFace approach), field-max normalization for HLE, and expanded from 17 to 19 tracked benchmarks (added HLE and BBH).
Arena Elo integration
Added Arena Elo as 30% of benchmark weight. Previously all benchmarks were weighted equally, which allowed models with extreme math scores to rank above more well-rounded models.
Benchmark-driven scoring
Initial release with benchmark-driven scoring (90%), capabilities (5%), and context window (5%). 17 benchmarks tracked. Models without benchmarks capped at 40.
If something is unclear or you want to suggest improvements to our methodology, we want to hear from you.