Language Model by Alibaba
Qwen3-1.7B represents Alibaba\'s aggressive play in the efficiency-optimized LLM segment, trained on 36 trillion tokens across 119 languages while maintaining a deployment footprint 43% smaller than Qwen2.5-3B. The model achieves 50.99% on MMLU (14.63 points behind Qwen2.5-3B) but surprises with only a 3.05 point gap on EvalPlus coding benchmarks, suggesting selective capability preservation through Strong-to-Weak Distillation. At $0.11 per million input tokens, it undercuts most sub-2B models while introducing a hybrid thinking mode that dynamically switches between chain-of-thought and direct response patterns based on query complexity.
| Benchmark | Qwen3-1.7B | Comparison |
|---|---|---|
| MMLU | 50.99% | 65.62% |
| GSM8K | 43.97% | 79.08% |
| MATH | 26.1% | 42.64% |
| EvalPlus | 43.23% | 46.28% |
| MultiPL-E | 28.06% | 39.65% |
| MBPP | 46.4% | 54.6% |
| GPQA | 24.24% | 26.26% |
| BBH | 51.7% | 56.3% |
| MMLU-Pro | 29.23% | 34.61% |
| MGSM | 33.11% | 47.53% |
| Artificial Analysis Intelligence Index | 7% | 8% |
Qwen3 series officially released including 1.7B variant
Technical report published with comprehensive benchmarks
Qwen3-1.7B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Qwen3-1.7B shows predictable degradation in reasoning-heavy tasks: GSM8K drops 35.11 points (43.97% vs 79.08%), MATH loses 16.54 points (26.1% vs 42.64%), and MMLU falls 14.63 points (50.99% vs 65.62%). However, coding benchmarks show surprising resilience with EvalPlus at 43.23% (only 3.05 points behind) and MBPP at 46.4% (8.2 point gap), indicating the Strong-to-Weak Distillation process successfully prioritized code generation capabilities during compression.
At $0.11 per million input tokens and $0.42 per million output tokens, Qwen3-1.7B costs approximately 3.8x more for output than input. For a typical production scenario processing 100M tokens daily (70% input, 30% output), monthly costs would be $343.50, compared to $1,260 for GPT-3.5-Turbo at similar volumes. The 32K context window means fewer prompt truncations than 8K-limited alternatives, potentially reducing total token usage by 15-20% for document-heavy applications.
The model uses QK-Norm attention mechanisms to identify query complexity patterns and automatically switches between thinking and non-thinking modes without explicit prompting. Complex reasoning queries trigger internal chain-of-thought generation (similar to o1-preview) while simple factual queries bypass this overhead. Performance data shows this reduces latency by 40-60% on straightforward queries while maintaining 26.1% accuracy on MATH problems that require multi-step reasoning.
The 1.7B parameter count creates clear capability ceilings: GPQA scientific reasoning scores only 24.24% (barely above random), multilingual math (MGSM) hits just 33.11% compared to 47.53% for the 3B variant, and the Artificial Analysis Intelligence Index rates it 7/10 versus 8/10 for comparable models. The model particularly struggles with tasks requiring extensive world knowledge or complex multi-hop reasoning, making it unsuitable for research-grade applications or high-stakes decision support.
The model's strength lies in code completion and simple programming tasks (46.4% MBPP, 43.23% EvalPlus) combined with basic multilingual support across 119 languages. Optimal deployments include IDE code suggestions, API response generation, multilingual customer support automation, and structured data extraction where the 32K context window provides advantages. The 51.7% BBH score indicates competence at logical puzzles and pattern matching, making it suitable for rule-based workflow automation.