透明的评分方法

我们如何评分AI模型

LM Market Cap上的每个模型都会获得0到100的评分，主要基于标准化基准测试结果（90%权重），来自多个独立基准测试来源。功能和上下文窗口作为补充信号（10%）。没有黑箱，没有付费排名。

367+

已评分模型

59+

已追踪服务商

免费模型

每小时

刷新频率

综合评分

Each model's final score is driven by benchmark performance (90%), with capabilities and context as tiebreakers (10%). Benchmark scores are normalized to 0-100 and averaged across all available evaluations, then multiplied by their weight. Here is the breakdown:

Benchmark Performance

90% weight

The primary ranking signal. Average percentile across standardized evaluations including Arena Elo ratings, MMLU, GPQA, HumanEval, SWE-bench, MATH, GSM8K, IFEval, and 10+ additional benchmarks. Aggregated from multiple independent benchmark sources and official evaluations.

Capabilities

5% weight

Tiebreaker signal measuring feature breadth: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Only used to differentiate models with similar benchmark scores.

Context Window

5% weight

Tiebreaker signal scoring context window size relative to the field. Helps differentiate models with equivalent benchmark performance by favoring those that can process longer inputs.

数据来源

Model data is aggregated from multiple independent API sources covering 59+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more. Benchmark scores come from standardized evaluation suites and academic leaderboards.

每小时刷新

A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.

367+ Models

Coverage spans coding models, image generation, video generation, and multimodal models. 161 are open source and 35 are free to use.

标准数据

Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.

基准测试集成

Rankings are primarily driven by results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.

HumanEval

Code generation benchmark measuring functional correctness of synthesized programs from docstrings.

SWE-bench

Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.

GPQA

Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.

GSM8K

Grade school math word problems testing multi-step mathematical reasoning.

MATH

Competition-level mathematics problems requiring advanced problem-solving.

Benchmark scores are normalized to a 0-100 scale and averaged across all available evaluations. Models with more benchmark coverage receive more stable scores. Sparse coverage (1-2 metrics) incurs a penalty to prevent inflated averages. Explore all benchmarks on thebenchmarks page.

SignalScore详解

Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.

Performance

Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.

Cost Efficiency

How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.

Capabilities

Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.

Context

Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.

Freshness

Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.

Value

Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.

排行榜对比模型基准测试数据

评分计算公式

CompositeScore =

BenchmarkPercentile x 0.90

+ Capabilities x 0.05

+ ContextWindow x 0.05

BenchmarkPercentile is the average normalized score (0-100) across all available standardized evaluations for each model. Arena Elo ratings are normalized from the 900-1500 range. Models without benchmark data are scored on capabilities and context only, capped at 40 to ensure they rank below empirically evaluated models.

方法论常见问题

评分每小时使用实时数据重新计算。当新模型发布或价格变化时，更新将在下一个刷新周期中反映。

标准化基准测试（Arena Elo、MMLU、GPQA、HumanEval、SWE-bench等）是衡量模型质量最严谨和可重现的方法。通过将基准测试作为主导信号，我们的排名与同行评审的评估方法一致，而非基于主观或炒作因素。功能和上下文窗口仅在基准分数接近时作为辅助排序依据。

是的。定价不是排名公式的一部分 - 只有基准性能才重要。在Arena Elo、MMLU、HumanEval等评估中表现良好的免费模型将超过基准结果较低的昂贵模型。我们单独显示定价，以便用户根据其特定用例考虑成本。

每个模型根据其标准功能和各服务商中最优价格进行评分。我们聚合来自多个端点的可用性，因此我们的数据反映每个模型最具竞争力的产品。

综合评分

Benchmark Performance

90% weight

Capabilities

5% weight

Tiebreaker signal measuring feature breadth: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Only used to differentiate models with similar benchmark scores.

Context Window

5% weight

Tiebreaker signal scoring context window size relative to the field. Helps differentiate models with equivalent benchmark performance by favoring those that can process longer inputs.

数据来源

每小时刷新

A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.

367+ Models

Coverage spans coding models, image generation, video generation, and multimodal models. 161 are open source and 35 are free to use.

标准数据

Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.

基准测试集成

MMLU

Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.

HumanEval

Code generation benchmark measuring functional correctness of synthesized programs from docstrings.

SWE-bench

Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.

GPQA

Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.

GSM8K

Grade school math word problems testing multi-step mathematical reasoning.

MATH

Competition-level mathematics problems requiring advanced problem-solving.

SignalScore详解

Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.

Performance

Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.

Cost Efficiency

How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.

Capabilities

Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.

Context

Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.

Freshness

Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.

Value

Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.

评分计算公式

CompositeScore =

BenchmarkPercentile x 0.90

+ Capabilities x 0.05

+ ContextWindow x 0.05