Transparent scoring methodology

How We Score AI Models

Every model on LM Market Cap receives a score from 0 to 100, driven primarily by standardized benchmark results (90% weight) from multiple independent benchmark sources. Capabilities and context window serve as tiebreakers (10%). No black boxes, no pay-to-rank.

367+

Models scored

59+

Providers tracked

Free models

Hourly

Refresh cadence

The Composite Score

Each model's final score is driven by benchmark performance (90%), with capabilities and context as tiebreakers (10%). Benchmark scores are normalized to 0-100 and averaged across all available evaluations, then multiplied by their weight. Here is the breakdown:

Benchmark Performance

90% weight

The primary ranking signal. Average percentile across standardized evaluations including Arena Elo ratings, MMLU, GPQA, HumanEval, SWE-bench, MATH, GSM8K, IFEval, and 10+ additional benchmarks. Aggregated from multiple independent benchmark sources and official evaluations.

Capabilities

5% weight

Tiebreaker signal measuring feature breadth: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Only used to differentiate models with similar benchmark scores.

Context Window

5% weight

Tiebreaker signal scoring context window size relative to the field. Helps differentiate models with equivalent benchmark performance by favoring those that can process longer inputs.

Data Sources

Model data is aggregated from multiple independent API sources covering 59+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more. Benchmark scores come from standardized evaluation suites and academic leaderboards.

Hourly Refresh

A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.

367+ Models

Coverage spans coding models, image generation, video generation, and multimodal models. 161 are open source and 35 are free to use.

Canonical Data

Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.

Benchmark Integration

Rankings are primarily driven by results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.

HumanEval

Code generation benchmark measuring functional correctness of synthesized programs from docstrings.

SWE-bench

Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.

GPQA

Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.

GSM8K

Grade school math word problems testing multi-step mathematical reasoning.

MATH

Competition-level mathematics problems requiring advanced problem-solving.

Benchmark scores are normalized to a 0-100 scale and averaged across all available evaluations. Models with more benchmark coverage receive more stable scores. Sparse coverage (1-2 metrics) incurs a penalty to prevent inflated averages. Explore all benchmarks on thebenchmarks page.

SignalScore Breakdown

Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.

Performance

Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.

Cost Efficiency

How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.

Capabilities

Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.

Context

Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.

Freshness

Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.

Value

Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.

Leaderboards Compare models Benchmark data

Score Calculation Formula

CompositeScore =

BenchmarkPercentile x 0.90

+ Capabilities x 0.05

+ ContextWindow x 0.05

BenchmarkPercentile is the average normalized score (0-100) across all available standardized evaluations for each model. Arena Elo ratings are normalized from the 900-1500 range. Models without benchmark data are scored on capabilities and context only, capped at 40 to ensure they rank below empirically evaluated models.

Methodology FAQ

Scores are recalculated every hour from a multi-source data pipeline that pulls live pricing, capabilities, and benchmark results from upstream provider APIs and benchmark sources. When a new model is released or pricing changes, the update is reflected within the next refresh cycle.

Standardized benchmarks (Arena Elo, MMLU, GPQA, HumanEval, SWE-bench, etc.) are the most rigorous and reproducible measure of model quality. By making benchmarks the dominant signal, our rankings align with peer-reviewed evaluation methodology rather than subjective or hype-driven factors. Capabilities and context window serve as tiebreakers only when benchmark scores are close.

Yes. Pricing is not part of the ranking formula - only benchmark performance matters. A free model that scores well on Arena Elo, MMLU, HumanEval, and other evaluations will outrank expensive models with lower benchmark results. We display pricing separately so users can factor in cost for their specific use case.

Each model is scored based on its canonical capabilities and the best available pricing across providers. We aggregate availability from multiple endpoints, so our data reflects the most competitive offering for each model.

The Composite Score

Benchmark Performance

90% weight

Capabilities

5% weight

Tiebreaker signal measuring feature breadth: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Only used to differentiate models with similar benchmark scores.

Context Window

5% weight

Tiebreaker signal scoring context window size relative to the field. Helps differentiate models with equivalent benchmark performance by favoring those that can process longer inputs.

Data Sources

Hourly Refresh

A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.

367+ Models

Coverage spans coding models, image generation, video generation, and multimodal models. 161 are open source and 35 are free to use.

Canonical Data

Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.

Benchmark Integration

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.

HumanEval

Code generation benchmark measuring functional correctness of synthesized programs from docstrings.

SWE-bench

Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.

GPQA

Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.

GSM8K

Grade school math word problems testing multi-step mathematical reasoning.

MATH

Competition-level mathematics problems requiring advanced problem-solving.

SignalScore Breakdown

Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.

Performance

Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.

Cost Efficiency

How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.

Capabilities

Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.

Context

Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.

Freshness

Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.

Value

Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.

Score Calculation Formula

CompositeScore =

BenchmarkPercentile x 0.90

+ Capabilities x 0.05

+ ContextWindow x 0.05