Analyzes score-per-context-token ratio across 300 AI models to find those that make the best use of their context window, output capacity, and cost.
Key efficiency metrics across all analyzed models.
Avg Overall Efficiency
6.8%
normalized across all models
Top 50 models ranked by score per million context tokens.
Efficiency breakdown across context window tiers.
Are bigger context windows correlated with higher scores?
Diminishing returns detected: Larger context windows do not always correlate with higher average scores.
| Tier | Avg Context | Avg Score | Avg Efficiency |
|---|---|---|---|
| Small | 14K | 56 | 5293.9 |
| Medium | 57K | 49 | 1035.3 |
| Large | 203K | 58 | 318.8 |
| Mega | 1.1M | 65 | 60.9 |
Top 20 models by output efficiency (score per 1K output tokens). Models with 16K+ output tokens are highlighted.
| Model | Score | Max Output | Output Eff. |
|---|---|---|---|
| Gemma 2 27BGoogle | 77 | 2K | 37.8 |
| MiniMax M2-herMiniMax | 69 | 2K | 33.7 |
| UI-TARS 7B ByteDance | 40 | 2K | 19.5 |
| GPT-4o (2024-05-13)OpenAI | 71 | 4K | 17.4 |
| GPT-4 TurboOpenAI | 67 | 4K | 16.3 |
| GPT-4 (older v0314)OpenAI | 65 | 4K | 15.8 |
| GPT-4OpenAI | 65 | 4K | 15.8 |
| GPT-4 Turbo PreviewOpenAI | 60 | 4K | 14.6 |
| GPT-4 Turbo (older v1106)OpenAI | 60 | 4K | 14.6 |
| Claude 3 HaikuAnthropic | 50 | 4K | 12.3 |
| Command R+ (08-2024)Cohere | 49 | 4K | 12.2 |
| Command R (08-2024)Cohere | 49 | 4K | 12.2 |
| Jamba Large 1.7AI21 Labs | 40 | 4K | 9.8 |
| DeepSeek V3.1DeepSeek | 69 | 7K | 9.6 |
| MiniMax M2.5 (free)MiniMax | 78 | 8K | 9.5 |
| Gemini 2.0 FlashGoogle | 72 | 8K | 8.8 |
| Nova Lite 1.0Amazon | 40 | 5K | 7.8 |
| Nova Micro 1.0Amazon | 40 | 5K | 7.8 |
| Qwen3 8BAlibaba | 61 | 8K | 7.4 |
| Gemini 2.0 Flash LiteGoogle | 59 | 8K | 7.2 |
Auto-generated observations from the efficiency data.
Context Sweet Spot
Small models have the highest average efficiency at 5293.9 score/MToken across 8 models.
Output Matters
Models with 16K+ output tokens score 15% higher on average than models with smaller output limits.
Compact High Performers
0 models achieve top-20 scores with under 128K context.
Dive deeper into context windows, compare models, or explore other dimensions.
Efficiency is measured as the score-per-context-token ratio - how much ranking score a model achieves relative to its context window size. Models that score highly with smaller context windows are considered more efficient than those requiring massive context to achieve similar results.
Cost efficiency combines quality (composite score) with pricing. The most cost-efficient models achieve high benchmark scores while maintaining low per-token API costs. Free and budget-tier models that perform well are the most cost-efficient options.
Not necessarily. Our efficiency analysis shows diminishing returns beyond certain context sizes. Models with 128K tokens often score similarly to those with 1M+ tokens, meaning the extra context capacity adds cost without proportional quality gains for most use cases.