Analyzes score-per-context-token ratio across 299 AI models to find those that make the best use of their context window, output capacity, and cost.
Key efficiency metrics across all analyzed models.
Avg Overall Efficiency
6.0%
normalized across all models
Top 50 models ranked by score per million context tokens.
Efficiency breakdown across context window tiers.
Are bigger context windows correlated with higher scores?
Diminishing returns detected: Larger context windows do not always correlate with higher average scores.
| Tier | Avg Context | Avg Score | Avg Efficiency |
|---|---|---|---|
| Small | 10K | 48 | 6679.4 |
| Medium | 48K | 45 | 1067.6 |
| Large | 208K | 57 | 307.7 |
| Mega | 1.2M | 64 | 60.9 |
Top 20 models by output efficiency (score per 1K output tokens). Models with 16K+ output tokens are highlighted.
| Model | Score | Max Output | Output Eff. |
|---|---|---|---|
| Gemma 2 27BGoogle | 77 | 2K | 37.6 |
| MiniMax M2-herMiniMax | 69 | 2K | 33.6 |
| UI-TARS 7B ByteDance | 40 | 2K | 19.5 |
| GPT-4o (2024-05-13)OpenAI | 71 | 4K | 17.3 |
| GPT-4 TurboOpenAI | 66 | 4K | 16.2 |
| GPT-4OpenAI | 65 | 4K | 15.7 |
| GPT-4 Turbo PreviewOpenAI | 59 | 4K | 14.5 |
| Claude 3 HaikuAnthropic | 51 | 4K | 12.4 |
| Command R (08-2024)Cohere | 48 | 4K | 12.1 |
| Command R+ (08-2024)Cohere | 48 | 4K | 12.1 |
| Gemma 4 31B (free)Google | 80 | 8K | 9.8 |
| Jamba Large 1.7AI21 Labs | 40 | 4K | 9.8 |
| GPT-3.5 Turbo 16kOpenAI | 40 | 4K | 9.8 |
| GPT-3.5 TurboOpenAI | 40 | 4K | 9.8 |
| ALLaM 2 7B InstructHUMAIN | 40 | 4K | 9.8 |
| ALLaM 34BHUMAIN | 40 | 4K | 9.8 |
| GPT-3.5 Turbo (older v0613)OpenAI | 39 | 4K | 9.6 |
| ALLaM 7B Instruct (preview)HUMAIN | 38 | 4K | 9.4 |
| Nova Lite 1.0Amazon | 40 | 5K | 7.8 |
| Nova Micro 1.0Amazon | 40 | 5K | 7.8 |
Auto-generated observations from the efficiency data.
Context Sweet Spot
Small models have the highest average efficiency at 6679.4 score/MToken across 10 models.
Output Matters
Models with 16K+ output tokens score 27% higher on average than models with smaller output limits.
Compact High Performers
0 models achieve top-20 scores with under 128K context.
Dive deeper into context windows, compare models, or explore other dimensions.
Efficiency is measured as the score-per-context-token ratio - how much ranking score a model achieves relative to its context window size. Models that score highly with smaller context windows are considered more efficient than those requiring massive context to achieve similar results.
Cost efficiency combines quality (composite score) with pricing. The most cost-efficient models achieve high benchmark scores while maintaining low per-token API costs. Free and budget-tier models that perform well are the most cost-efficient options.
Not necessarily. Our efficiency analysis shows diminishing returns beyond certain context sizes. Models with 128K tokens often score similarly to those with 1M+ tokens, meaning the extra context capacity adds cost without proportional quality gains for most use cases.