Anthropic's restricted Capybara-tier model versus OpenAI's publicly available GPT-5.4. Benchmark data from Anthropic's 244-page system card shows Mythos leading on all evaluated tasks, but GPT-5.4 remains the only one you can actually use today.
Mythos benchmarks come from Anthropic's own system card evaluations - not independent testing. GPT-5.4 is publicly available; Mythos is restricted to Glasswing partners. These may not be apples-to-apples comparisons due to different evaluation configurations.
| Benchmark | Mythos | GPT-5.4 | Winner |
|---|---|---|---|
| SWE-bench Verified | 93.9% | 80% | Mythos +13.9 |
| SWE-bench Pro | 77.8% | 57.7% | Mythos +20.1 |
| Terminal-Bench 2.0 | 82% | 75.1% | Mythos +6.9 |
| Terminal-Bench 2.1 (4hr) | 92.1% | 75.3% | Mythos +16.8 |
| GPQA Diamond | 94.6% | 92.8% | Mythos +1.8 |
| MMMLU | 92.7% | -- | |
| Humanity's Last Exam (no tools) | 56.8% | 39.8% | Mythos +17.0 |
| Humanity's Last Exam (with tools) | 64.7% | 52.1% | Mythos +12.6 |
| USAMO 2026 | 97.6% | 95.2% | Mythos +2.4 |
| OSWorld | 79.6% | 75% | Mythos +4.6 |
| GraphWalks BFS (256K-1M) | 80% | 21.4% | Mythos +58.6 |
Mythos scores from Anthropic system card. GPT-5.4 scores from Anthropic's comparative evaluations (not OpenAI's own reporting). Different evaluation harnesses may produce different results.
Long-context retrieval at 256K-1M tokens
Real-world software engineering tasks
Extended agentic coding (4-hour timeout)
Expert-level knowledge (no tools)
| Mythos | GPT-5.4 | |
|---|---|---|
| Input $/1M tokens | $25.00 * | ~$2.50 |
| Output $/1M tokens | $125.00 * | ~$10.00 |
| Public API | Restricted | Available |
| Provider | Anthropic | OpenAI |
| Cost ratio | Mythos is ~10-12x more expensive than GPT-5.4 | |
* Glasswing partner pricing. Public API pricing may differ.
On paper, Claude Mythos dominates GPT-5.4 across every benchmark Anthropic tested. The long-context gap (80% vs 21.4% on GraphWalks BFS) and coding gap (77.8% vs 57.7% on SWE-bench Pro) are particularly striking. However, Mythos costs ~10x more, isn't publicly available, and these benchmarks come from Anthropic's own evaluations. For users who need a frontier model today, GPT-5.4 remains one of the best options alongside Claude Opus 4.6.
Based on available benchmarks from Anthropic's system card, Mythos leads GPT-5.4 on every evaluated benchmark where both have scores, including SWE-bench Pro (77.8% vs 57.7%), USAMO 2026 (97.6% vs 95.2%), and GraphWalks BFS (80.0% vs 21.4%). However, these are Anthropic's own evaluations and have not been independently verified. GPT-5.4 may perform differently on benchmarks not covered by the system card.
No. GPT-5.4 is publicly available through OpenAI's API, while Claude Mythos is restricted to Project Glasswing cybersecurity partners. For production use, GPT-5.4 and Claude Opus 4.6 are the top publicly available options from their respective providers.
GPT-5.4 is available at approximately $2.50/$10 per million input/output tokens through OpenAI's API. Mythos is priced at $25/$125 for Glasswing partners - roughly 10x more expensive. Mythos is designed as a premium capability tier, not a price-competitive offering.