分析297个AI模型的评分/上下文令牌比率,找出最充分利用上下文窗口、输出容量和成本的模型。
所有分析模型的关键效率指标。
平均总体效率
7.3%
所有模型标准化
按每百万上下文令牌评分排名的前50个模型。
不同上下文窗口层级的效率分析。
更大的上下文窗口是否与更高的评分相关?
| 层级 | 平均上下文 | 平均评分 | 平均效率 |
|---|---|---|---|
| Small | 10K | 45 | 5302.8 |
| Medium | 51K | 56 | 1327.7 |
| Large | 191K | 71 | 413.7 |
| Mega | 1.1M | 81 | 74.4 |
按输出效率(每1K输出令牌评分)排名的前20个模型。16K+输出令牌的模型已高亮显示。
| 模型 | 评分 | 最大输出 | 输出效率 |
|---|---|---|---|
| Inflection 3 ProductivityInflection | 37 | 1K | 35.7 |
| Inflection 3 PiInflection | 37 | 1K | 35.7 |
| UI-TARS 7B ByteDance | 63 | 2K | 30.5 |
| Gemma 2 27BGoogle | 60 | 2K | 29.1 |
| MiniMax M2-herMiniMax | 59 | 2K | 29.0 |
| Gemma 3n 2B (free)Google | 58 | 2K | 28.3 |
| Gemma 3n 4B (free)Google | 55 | 2K | 27.0 |
| Jamba Large 1.7AI21 Labs | 71 | 4K | 17.3 |
| GPT-4 TurboOpenAI | 60 | 4K | 14.7 |
| GPT-4o (2024-05-13)OpenAI | 53 | 4K | 12.8 |
| Command R+ (08-2024)Cohere | 48 | 4K | 11.9 |
| Command R (08-2024)Cohere | 48 | 4K | 11.9 |
| Llemma 7beleutherai | 47 | 4K | 11.5 |
| Nova Lite 1.0Amazon | 58 | 5K | 11.3 |
| Nova Pro 1.0Amazon | 58 | 5K | 11.3 |
| Command R7B (12-2024)Cohere | 45 | 4K | 11.2 |
| Sonar Pro SearchPerplexity | 85 | 8K | 10.6 |
| Claude 3 HaikuAnthropic | 43 | 4K | 10.5 |
| GPT-4 Turbo PreviewOpenAI | 43 | 4K | 10.4 |
| GPT-4 Turbo (older v1106)OpenAI | 43 | 4K | 10.4 |
从效率数据中自动生成的观察结果。
上下文最优点
Small models have the highest average efficiency at 5302.8 score/MToken across 20 models.
输出很重要
Models with 16K+ output tokens score 35% higher on average than models with smaller output limits.
紧凑型高性能模型
0 models achieve top-20 scores with under 128K context.
Efficiency is measured as the score-per-context-token ratio - how much ranking score a model achieves relative to its context window size. Models that score highly with smaller context windows are considered more efficient than those requiring massive context to achieve similar results.
Cost efficiency combines quality (composite score) with pricing. The most cost-efficient models achieve high benchmark scores while maintaining low per-token API costs. Free and budget-tier models that perform well are the most cost-efficient options.
Not necessarily. Our efficiency analysis shows diminishing returns beyond certain context sizes. Models with 128K tokens often score similarly to those with 1M+ tokens, meaning the extra context capacity adds cost without proportional quality gains for most use cases.