Every model on LM Market Cap receives a score from 0 to 100, driven primarily by standardized benchmark results (90% weight) from multiple independent benchmark sources. Capabilities and context window serve as tiebreakers (10%). No black boxes, no pay-to-rank.
367+
Models scored
59+
Providers tracked
35
Free models
Hourly
Refresh cadence
Each model's final score is driven by benchmark performance (90%), with capabilities and context as tiebreakers (10%). Benchmark scores are normalized to 0-100 and averaged across all available evaluations, then multiplied by their weight. Here is the breakdown:
The primary ranking signal. Average percentile across standardized evaluations including Arena Elo ratings, MMLU, GPQA, HumanEval, SWE-bench, MATH, GSM8K, IFEval, and 10+ additional benchmarks. Aggregated from multiple independent benchmark sources and official evaluations.
Tiebreaker signal measuring feature breadth: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Only used to differentiate models with similar benchmark scores.
Tiebreaker signal scoring context window size relative to the field. Helps differentiate models with equivalent benchmark performance by favoring those that can process longer inputs.
Model data is aggregated from multiple independent API sources covering 59+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more. Benchmark scores come from standardized evaluation suites and academic leaderboards.
A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.
Coverage spans coding models, image generation, video generation, and multimodal models. 161 are open source and 35 are free to use.
Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.
Rankings are primarily driven by results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.
Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.
Code generation benchmark measuring functional correctness of synthesized programs from docstrings.
Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.
Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.
Grade school math word problems testing multi-step mathematical reasoning.
Competition-level mathematics problems requiring advanced problem-solving.
Benchmark scores are normalized to a 0-100 scale and averaged across all available evaluations. Models with more benchmark coverage receive more stable scores. Sparse coverage (1-2 metrics) incurs a penalty to prevent inflated averages. Explore all benchmarks on thebenchmarks page.
Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.
Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.
How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.
Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.
Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.
Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.
Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.
CompositeScore =
BenchmarkPercentile x 0.90
+ Capabilities x 0.05
+ ContextWindow x 0.05
BenchmarkPercentile is the average normalized score (0-100) across all available standardized evaluations for each model. Arena Elo ratings are normalized from the 900-1500 range. Models without benchmark data are scored on capabilities and context only, capped at 40 to ensure they rank below empirically evaluated models.
Scores are recalculated every hour from a multi-source data pipeline that pulls live pricing, capabilities, and benchmark results from upstream provider APIs and benchmark sources. When a new model is released or pricing changes, the update is reflected within the next refresh cycle.
Standardized benchmarks (Arena Elo, MMLU, GPQA, HumanEval, SWE-bench, etc.) are the most rigorous and reproducible measure of model quality. By making benchmarks the dominant signal, our rankings align with peer-reviewed evaluation methodology rather than subjective or hype-driven factors. Capabilities and context window serve as tiebreakers only when benchmark scores are close.
Yes. Pricing is not part of the ranking formula - only benchmark performance matters. A free model that scores well on Arena Elo, MMLU, HumanEval, and other evaluations will outrank expensive models with lower benchmark results. We display pricing separately so users can factor in cost for their specific use case.
Each model is scored based on its canonical capabilities and the best available pricing across providers. We aggregate availability from multiple endpoints, so our data reflects the most competitive offering for each model.