LM Market Cap上的每个模型都会获得0到100的评分,主要基于标准化基准测试结果(90%权重),来自多个独立基准测试来源。功能和上下文窗口作为补充信号(10%)。没有黑箱,没有付费排名。
367+
已评分模型
59+
已追踪服务商
35
免费模型
每小时
刷新频率
Each model's final score is driven by benchmark performance (90%), with capabilities and context as tiebreakers (10%). Benchmark scores are normalized to 0-100 and averaged across all available evaluations, then multiplied by their weight. Here is the breakdown:
The primary ranking signal. Average percentile across standardized evaluations including Arena Elo ratings, MMLU, GPQA, HumanEval, SWE-bench, MATH, GSM8K, IFEval, and 10+ additional benchmarks. Aggregated from multiple independent benchmark sources and official evaluations.
Tiebreaker signal measuring feature breadth: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Only used to differentiate models with similar benchmark scores.
Tiebreaker signal scoring context window size relative to the field. Helps differentiate models with equivalent benchmark performance by favoring those that can process longer inputs.
Model data is aggregated from multiple independent API sources covering 59+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more. Benchmark scores come from standardized evaluation suites and academic leaderboards.
A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.
Coverage spans coding models, image generation, video generation, and multimodal models. 161 are open source and 35 are free to use.
Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.
Rankings are primarily driven by results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.
Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.
Code generation benchmark measuring functional correctness of synthesized programs from docstrings.
Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.
Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.
Grade school math word problems testing multi-step mathematical reasoning.
Competition-level mathematics problems requiring advanced problem-solving.
Benchmark scores are normalized to a 0-100 scale and averaged across all available evaluations. Models with more benchmark coverage receive more stable scores. Sparse coverage (1-2 metrics) incurs a penalty to prevent inflated averages. Explore all benchmarks on thebenchmarks page.
Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.
Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.
How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.
Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.
Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.
Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.
Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.
CompositeScore =
BenchmarkPercentile x 0.90
+ Capabilities x 0.05
+ ContextWindow x 0.05
BenchmarkPercentile is the average normalized score (0-100) across all available standardized evaluations for each model. Arena Elo ratings are normalized from the 900-1500 range. Models without benchmark data are scored on capabilities and context only, capped at 40 to ensure they rank below empirically evaluated models.
评分每小时使用实时数据重新计算。当新模型发布或价格变化时,更新将在下一个刷新周期中反映。
标准化基准测试(Arena Elo、MMLU、GPQA、HumanEval、SWE-bench等)是衡量模型质量最严谨和可重现的方法。通过将基准测试作为主导信号,我们的排名与同行评审的评估方法一致,而非基于主观或炒作因素。功能和上下文窗口仅在基准分数接近时作为辅助排序依据。
是的。定价不是排名公式的一部分 - 只有基准性能才重要。在Arena Elo、MMLU、HumanEval等评估中表现良好的免费模型将超过基准结果较低的昂贵模型。我们单独显示定价,以便用户根据其特定用例考虑成本。
每个模型根据其标准功能和各服务商中最优价格进行评分。我们聚合来自多个端点的可用性,因此我们的数据反映每个模型最具竞争力的产品。