Language Model by Alibaba
Qwen3-Next-80B-A3B represents a radical departure from traditional transformer scaling, achieving 80B-parameter performance while activating only 3B parameters per token through a novel Hybrid Attention architecture combining Gated DeltaNet and sparse MoE. The model delivers 10x inference throughput on 32K+ contexts compared to dense transformers, scores 90.9% on MMLU-Redux (matching much larger models), and uniquely offers both thinking and non-thinking modes while maintaining input costs as low as $0.09 per million tokens. Most critically, Alibaba achieved this at just 10% of the training compute required for their previous Qwen3-32B-Base, potentially redefining the economics of frontier model development.
| Benchmark | Qwen3-Next-80A3B | Comparison |
|---|---|---|
| SWE-Bench Verified | 70% | - |
| MMLU-Redux | 90.9% | - |
| MultiPL-E | 87.8% | - |
| IFEval | 87.6% | 90.4% |
| WritingBench | 87.3% | - |
| Creative Writing v3 | 85.3% | - |
| Artificial Analysis Intelligence Index | 27% | 15% |
| Output Speed | 168.7% | 84.2% |
Qwen3 series announced
Qwen3-Next-80B-A3B released
FP8 quantized version released
Qwen3-Next-80A3B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
On SWE-Bench Verified, Qwen3-Next-80B-A3B achieves 70% accuracy, positioning it among the top coding models despite using only 3B active parameters. While the model trails specialized coding variants like DeepSeek-V3, it significantly outperforms most models in its parameter class on MultiPL-E with 87.8% accuracy. The combination of strong coding performance with $0.09-0.50 per million input tokens makes it particularly compelling for code generation pipelines where Claude's $3-15 pricing creates budget constraints.
The Hybrid Attention mechanism combines Gated DeltaNet (a linear attention variant) with traditional Gated Attention in a learned routing pattern, enabling the model to process 256K token contexts with 10x the throughput of dense transformers at 32K+ sequence lengths. This architecture activates only 3B of the total 80B parameters per token through high-sparsity MoE routing, resulting in 168.7 tokens/second output speed versus 84.2 for similarly-sized models. For production deployments handling long documents or multi-turn conversations, this translates to dramatically lower GPU memory requirements and faster response times without sacrificing quality.
Qwen3-Next-80B-A3B implements dual inference modes: a standard mode for rapid responses and a thinking mode that performs internal chain-of-thought reasoning before generating outputs. Based on the Artificial Analysis Intelligence Index score of 27 (versus 15 median for similar models), the thinking mode appears to provide substantial reasoning improvements, though specific latency penalties aren't disclosed. The model seamlessly switches between modes based on query complexity, making it suitable for mixed workloads without manual prompt engineering.
Pricing ranges from $0.09-0.50 per million input tokens and $1.10-6.00 per million output tokens depending on the provider and thinking mode usage. For a typical RAG application processing 10K input tokens and generating 500 output tokens per query, costs range from $0.0014-0.0080 per request, compared to GPT-4o's $0.0275 or Claude 3.5 Sonnet's $0.0325. The 10-70x cost advantage makes Qwen3-Next particularly attractive for high-volume applications, though the upper pricing tier approaches GPT-4o-mini territory.
The model shows a 2.8 percentage point deficit on IFEval (87.6% vs Gemma 3 27B's 90.4%), suggesting slightly weaker instruction-following capabilities for complex multi-step tasks. Creative Writing v3 scores of 85.3% and WritingBench at 87.3% indicate solid but not frontier-level creative text generation. The Artificial Analysis Intelligence Index of 27 places it well below GPT-4o or Claude 3.5 Sonnet, confirming that despite efficiency gains, there remains a quality gap to top-tier models on reasoning-intensive tasks.