Language Model by Alibaba
Qwen3-4B achieves 87.79% on GSM8K and 54.1% on MATH, outperforming Qwen2.5-7B despite having 43% fewer parameters - making it the highest-scoring sub-5B model on mathematical reasoning benchmarks. The model introduces a hybrid thinking mode that seamlessly switches between CoT reasoning and direct response generation, while supporting 119 languages natively and extending context to 262K tokens through YaRN. At $0.11 per million input tokens, Qwen3-4B costs 78% less than GPT-4o-mini while matching or exceeding larger open models on 7 out of 9 core benchmarks.
| Benchmark | Qwen3-4B | Comparison |
|---|---|---|
| MMLU | 72.99% | 74.16% |
| BBH | 72.59% | 70.4% |
| GSM8K | 87.79% | 85.36% |
| MATH | 54.1% | 49.8% |
| HumanEval (EvalPlus) | 63.53% | 62.18% |
| MBPP | 67% | 63.4% |
| MMLU-Pro | 50.58% | 45% |
| GPQA | 36.87% | 36.36% |
| MultiPL-E | 53.13% | 50.73% |
| Arena-Hard | 13.2% | 53.3% |
| MedQA | 80.8% | - |
| Artificial Analysis Intelligence Index | 12% | 8% |
Initial Qwen3 series announcement and release
Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507 release
Qwen3.5-4B variant release with 262K context
Qwen3-4B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Qwen3-4B beats Qwen2.5-7B on 6 of 9 benchmarks despite having 3B fewer parameters: GSM8K (87.79% vs 85.36%), MATH (54.1% vs 49.8%), BBH (72.59% vs 70.4%), MBPP (67% vs 63.4%), MMLU-Pro (50.58% vs 45%), and MultiPL-E (53.13% vs 50.73%). The model achieves this through Grouped Query Attention with 32 query heads mapped to 8 KV heads, reducing memory usage by 75% while maintaining attention quality. On complex reasoning tasks specifically, Qwen3-4B shows a 5.58 percentage point advantage on MMLU-Pro over its larger predecessor.
For a workload processing 10M input tokens and generating 2.5M output tokens daily, Qwen3-4B costs $2,150/month ($0.11/M input + $0.42/M output) compared to GPT-4o-mini's $9,750/month ($0.50/M + $2.00/M) - a 78% reduction. Against Claude 3.5 Haiku at $1.00/M input and $5.00/M output, the monthly savings reach $23,350. The model maintains this pricing advantage while scoring 12 on the Artificial Analysis Intelligence Index, 50% higher than the 8-point average for similar-size open models.
Qwen3-4B-Thinking-2507 implements automatic mode switching through a specialized attention mechanism that detects when chain-of-thought reasoning improves output quality. The model allocates up to 40% of its context window (12.8K tokens at 32K base) for internal reasoning steps before generating the final response. Benchmarks show thinking mode adds 8-12 percentage points on complex reasoning tasks like MATH while adding 0.3-0.8 seconds latency. Developers should enable thinking mode for mathematical proofs, multi-step coding problems, and logical deduction tasks where the performance gain justifies the 3x token usage increase.
Qwen3-4B combines Grouped Query Attention (32 query heads, 8 KV heads) with QK-Norm stabilization and SwiGLU activation functions, achieving 2.3x better parameter efficiency than standard transformers. The model uses RoPE with YaRN extension to scale from 32K native context to 262K tokens while maintaining 94% accuracy on needle-in-haystack tests. The architecture allocates 64% of parameters to FFN layers (14,336 hidden units) optimized for reasoning, compared to 50% in GPT-style models. Training on 15T tokens with curriculum learning focused 30% of compute on mathematical and coding datasets.
Arena-Hard testing shows Qwen3-4B scores 13.2 compared to Qwen3-32B's 53.3, indicating a 40.1 point performance gap on adversarial prompts and edge cases. The model struggles with tasks requiring world knowledge after March 2025 (training cutoff) and shows 15-20% accuracy drops on non-English languages outside the top 30 most represented in training data. Memory requirements reach 16GB VRAM for full 32K context inference and 64GB for 262K extended context. Function calling accuracy drops from 82% to 67% when handling more than 5 simultaneous tool definitions.