LM Market Cap上的每个模型都会获得0到100的综合评分,由六个加权维度计算得出。评分设计为透明、可重现,并对实际模型选择有用。没有黑箱,没有付费排名。
325+
已评分模型
52+
已追踪服务商
25
免费模型
每小时
刷新频率
Each model's final score is a weighted sum of six normalized dimensions. Every dimension is scored 0–100 independently, then multiplied by its weight to produce the composite. Here is the breakdown:
Measures the breadth of a model's feature set: vision, function calling, streaming, JSON mode, reasoning, web search, and image output. Models with more capabilities score higher.
Evaluates cost efficiency based on input and output token pricing. Free models score highest; expensive models are penalized. Reflects real API pricing from OpenRouter.
Scores the model's context window size relative to the field. Larger context windows enable processing of longer documents, entire codebases, and complex multi-turn conversations.
Rewards recently released models. Newer models benefit from the latest research and training techniques. This signal decays over time, reflecting the fast pace of AI development.
Measures maximum output token length. Models that can generate longer responses score higher, which matters for code generation, long-form content, and detailed analysis tasks.
Assesses multimodal flexibility by counting supported input and output modalities. Models that handle text, images, audio, and video across both directions score highest.
All model data is sourced from the OpenRouter API, which aggregates models from 52+ providers including OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, and more.
A cron job fetches updated model data every hour. New models, pricing changes, and capability updates are reflected within the next cycle.
Coverage spans coding models, image generation, video generation, and multimodal models. 150 are open source and 25 are free to use.
Each model is scored on its canonical capabilities and best available pricing. Duplicate listings across providers are deduplicated to the most competitive offering.
Performance scores incorporate results from established AI benchmarks. Each benchmark tests a different dimension of model intelligence, from general knowledge to specialized coding and reasoning tasks.
Massive Multitask Language Understanding - tests knowledge across 57 subjects from STEM to humanities.
Code generation benchmark measuring functional correctness of synthesized programs from docstrings.
Real-world software engineering tasks from GitHub issues, testing end-to-end coding ability.
Graduate-level questions in physics, biology, and chemistry requiring expert-level reasoning.
Grade school math word problems testing multi-step mathematical reasoning.
Competition-level mathematics problems requiring advanced problem-solving.
Benchmark scores are normalized to a 0–100 scale and aggregated with task-specific weights. Explore all benchmarks on the benchmarks page.
Each model produces six SignalScore objects that represent different facets of quality and value. These signals feed into the composite score and are individually visible on every model page.
Composite of benchmark results (MMLU, HumanEval, SWE-bench, GPQA, GSM8K) weighted by task relevance. Measures raw intelligence and problem-solving ability.
How much capability you get per dollar. Combines pricing tier with performance to identify models that deliver the best value at each price point.
Feature breadth score from vision, function calling, streaming, JSON mode, reasoning, web search, and image output support.
Normalized context window size relative to the maximum in the category. Rewards models that can process more information in a single request.
Time-decayed score based on release date. Recently launched models score higher, reflecting the rapid pace of AI advancement.
Overall value proposition combining all signals. Identifies models that strike the best balance across performance, price, features, and recency.
CompositeScore =
Capabilities x 0.25
+ Pricing x 0.25
+ Context x 0.15
+ Recency x 0.15
+ Output x 0.10
+ Versatility x 0.10
Each dimension is independently normalized to 0–100 before weighting. The final composite is also clamped to 0–100. All normalization uses min-max scaling against the current model population, meaning scores are relative to the field—not absolute.
评分每小时使用来自OpenRouter API的实时数据重新计算。当新模型发布或价格变化时,更新将在下一个刷新周期中反映。
功能和价格是实际模型选择中最具影响力的两个因素。模型需要强大的功能才有用,需要有竞争力的价格才实用。等权重确保两者都不会主导——功能差的便宜模型不会超过价格合理的全功能模型。
是的。免费模型获得最高的价格档位评分(综合评分的25%)。如果它们还具有强大的功能、不错的上下文窗口和较新的发布日期,它们可以也确实会超过更昂贵的模型。这是设计使然——价格可及性很重要。
每个模型根据其标准功能和各服务商中最优价格进行评分。OpenRouter API聚合来自多个端点的可用性,因此我们的数据反映每个模型最具竞争力的产品。