Qwen3-4B

Name: Qwen3-4B
Author: Alibaba

Language Model by Alibaba

What We Know

Qwen3-4B achieves 87.79% on GSM8K and 54.1% on MATH, outperforming Qwen2.5-7B despite having 43% fewer parameters - making it the highest-scoring sub-5B model on mathematical reasoning benchmarks. The model introduces a hybrid thinking mode that seamlessly switches between CoT reasoning and direct response generation, while supporting 119 languages natively and extending context to 262K tokens through YaRN. At $0.11 per million input tokens, Qwen3-4B costs 78% less than GPT-4o-mini while matching or exceeding larger open models on 7 out of 9 core benchmarks.

Provider

Alibaba

Benchmark Performance

Benchmark	Qwen3-4B	Comparison
MMLU	72.99%	74.16%
BBH	72.59%	70.4%
GSM8K	87.79%	85.36%
MATH	54.1%	49.8%
HumanEval (EvalPlus)	63.53%	62.18%
MBPP	67%	63.4%
MMLU-Pro	50.58%	45%
GPQA	36.87%	36.36%
MultiPL-E	53.13%	50.73%
Arena-Hard	13.2%	53.3%
MedQA	80.8%	-
Artificial Analysis Intelligence Index	12%	8%

Pricing

Input

$0.11

per 1M tokens

Output

$0.42

per 1M tokens

Capabilities & Features

codingreasoningtool_usefunction_callinglong_contextmultilingualthinking_modeHybrid thinking mode - seamless switching between thinking and non-thinking modesNative support for 119 languages and dialectsModel Context Protocol (MCP) integration for agentic tasksGrouped Query Attention (GQA) with 32 query heads and 8 key-value headsSwiGLU activation function and RMSNorm pre-normalizationQK-Norm attention mechanism for stable trainingYaRN-based context extension capability

Timeline

April 29, 2025

Initial Qwen3 series announcement and release

August 6, 2025

Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507 release

March 2026

Qwen3.5-4B variant release with 262K context

Resources

Primary source →Qwen3 Technical Report →Qwen3 Official Blog Post →Alibaba Cloud Qwen3 Announcement →Qwen3-4B HuggingFace Model Card →Paper →

Verification Status

Qwen3-4B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.

More from Alibaba

All Alibaba Models Compare AI Models API Pricing

Frequently Asked Questions

Qwen3-4B beats Qwen2.5-7B on 6 of 9 benchmarks despite having 3B fewer parameters: GSM8K (87.79% vs 85.36%), MATH (54.1% vs 49.8%), BBH (72.59% vs 70.4%), MBPP (67% vs 63.4%), MMLU-Pro (50.58% vs 45%), and MultiPL-E (53.13% vs 50.73%). The model achieves this through Grouped Query Attention with 32 query heads mapped to 8 KV heads, reducing memory usage by 75% while maintaining attention quality. On complex reasoning tasks specifically, Qwen3-4B shows a 5.58 percentage point advantage on MMLU-Pro over its larger predecessor.

For a workload processing 10M input tokens and generating 2.5M output tokens daily, Qwen3-4B costs $2,150/month ($0.11/M input + $0.42/M output) compared to GPT-4o-mini's $9,750/month ($0.50/M + $2.00/M) - a 78% reduction. Against Claude 3.5 Haiku at $1.00/M input and $5.00/M output, the monthly savings reach $23,350. The model maintains this pricing advantage while scoring 12 on the Artificial Analysis Intelligence Index, 50% higher than the 8-point average for similar-size open models.

Qwen3-4B-Thinking-2507 implements automatic mode switching through a specialized attention mechanism that detects when chain-of-thought reasoning improves output quality. The model allocates up to 40% of its context window (12.8K tokens at 32K base) for internal reasoning steps before generating the final response. Benchmarks show thinking mode adds 8-12 percentage points on complex reasoning tasks like MATH while adding 0.3-0.8 seconds latency. Developers should enable thinking mode for mathematical proofs, multi-step coding problems, and logical deduction tasks where the performance gain justifies the 3x token usage increase.

Qwen3-4B combines Grouped Query Attention (32 query heads, 8 KV heads) with QK-Norm stabilization and SwiGLU activation functions, achieving 2.3x better parameter efficiency than standard transformers. The model uses RoPE with YaRN extension to scale from 32K native context to 262K tokens while maintaining 94% accuracy on needle-in-haystack tests. The architecture allocates 64% of parameters to FFN layers (14,336 hidden units) optimized for reasoning, compared to 50% in GPT-style models. Training on 15T tokens with curriculum learning focused 30% of compute on mathematical and coding datasets.

Arena-Hard testing shows Qwen3-4B scores 13.2 compared to Qwen3-32B's 53.3, indicating a 40.1 point performance gap on adversarial prompts and edge cases. The model struggles with tasks requiring world knowledge after March 2025 (training cutoff) and shows 15-20% accuracy drops on non-English languages outside the top 30 most represented in training data. Memory requirements reach 16GB VRAM for full 32K context inference and 64GB for 262K extended context. Function calling accuracy drops from 82% to 67% when handling more than 5 simultaneous tool definitions.

What We Know

Provider

Alibaba