Language Model by Anthropic
Claude Opus 4.7 achieves 87.6% on SWE-bench Verified (6.8pp above Opus 4.6) and introduces a breakthrough \'xhigh\' effort level between high and max settings that enables hours-long autonomous workflows without the exponential cost scaling of max mode. The model\'s 3x improvement on Rakuten-SWE-Bench and 21% error reduction on document reasoning tasks position it as the first production-ready model for multi-hour agentic coding sessions at $5/$25 per million tokens. Most notably, Opus 4.7\'s multi-agent coordination allows parallel AI workstreams with automatic task budgeting - a feature no other commercial model offers - while its 98.5% score on XBOW Visual-Acuity (44pp jump from 4.6) enables real-time UI automation that actually works.
| Benchmark | claude-opus-4-7-release-benchmarks-pricing-features-2026/ | Comparison |
|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% |
| SWE-bench Pro | 64.3% | 57.7% |
| GPQA Diamond | 94.2% | 94.4% |
| MCP-Atlas (Scaled Tool Use) | 77.3% | 75.8% |
| MMMLU (Multilingual Q&A) | 91.5% | 92.6% |
| BrowseComp (Agentic Search) | 79.3% | 89.3% |
| Rakuten-SWE-Bench | 3x improvement | baseline |
| CursorBench (Autonomous Coding) | 70% | 58% |
| XBOW Visual-Acuity Benchmark | 98.5% | 54.5% |
| OfficeQA Pro (Document Reasoning) | 21% fewer errors | baseline |
| Hex 93-task Coding Benchmark | 13% improvement | baseline |
Project Glasswing announced with Claude Mythos Preview
Claude Opus 4.7 general availability
claude-opus-4-7-release-benchmarks-pricing-features-2026/ is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
At $5 input/$25 output per million tokens, Opus 4.7 costs 2.5x less than GPT-5.4 Pro ($12.50/$50) while achieving 64.3% on SWE-bench Pro versus GPT-5.4's 57.7%. For a typical 8-hour coding session averaging 2M input/500K output tokens, you'd pay $22.50 on Opus 4.7 versus $50 on GPT-5.4 Pro. The new xhigh effort level provides 85% of max mode's performance at 30% of the compute cost, making extended sessions financially viable.
Opus 4.7 processes images at up to 2,576 pixels resolution (3x higher than 4.6's 859 pixel limit), achieving 98.5% on XBOW Visual-Acuity benchmark. The model uses a new visual encoder that maintains aspect ratios without center-cropping, crucial for UI automation where button positions matter. This explains why CursorBench autonomous coding scores jumped from 58% to 70% - the model can now accurately read IDE interfaces and terminal outputs at native resolutions.
The new tokenizer generates 1.0-1.35x more tokens for identical inputs compared to Opus 4.6, meaning your costs could increase up to 35% without code changes. Non-English text sees the highest expansion (1.35x for Japanese, 1.28x for Arabic), while English averages 1.08x. Anthropic provides a migration endpoint that accepts 4.6-style tokens until July 2026, giving you time to update token budgets and implement the new tiktoken-compatible library.
Multi-agent coordination supports up to 8 parallel workstreams but shows diminishing returns above 4 agents due to coordination overhead - benchmark scores drop 12% at 6 agents and 23% at 8 agents. Each agent maintains separate context (consuming your 1M token window proportionally), and inter-agent communication adds 150-300ms latency per exchange. For optimal performance, limit to 3-4 agents working on clearly separable tasks with minimal interdependencies.
Opus 4.7 scores 79.3% on BrowseComp (Agentic Search) versus GPT-5.4 Pro's 89.3% - a 10pp deficit despite leading on most other benchmarks. The model prioritizes deterministic tool use over exploratory search patterns, making it excel at structured tasks like SWE-bench (87.6%) but struggle with open-ended web navigation. Anthropic's documentation confirms this is intentional: Opus 4.7 is optimized for reliability in production environments rather than creative exploration.