Language Model by Alibaba
Alibaba\'s Qwen3-8B-Base introduces hybrid thinking/non-thinking modes controlled by /think and /no_think flags, achieving 89.84% on GSM8K while maintaining just 8B parameters - outperforming the larger Qwen2.5-14B on MATH (60.8% vs 55.64%) and HumanEval+ (67.65% vs 60.7%). The model leverages Grouped Query Attention (GQA) to reduce KV cache by 75% compared to standard multi-head attention, enabling efficient 32K context processing at $0.18/$0.20 per million tokens. Released April 2025 under Apache 2.0, it supports 119 languages and demonstrates particular strength in code generation with 72.23% on EvalPlus, beating Qwen2.5-32B\'s 66.25% despite being 4x smaller.
| Benchmark | Qwen3-8B-Base | Comparison |
|---|---|---|
| MMLU | 76.89% | 79.66% |
| GSM8K | 89.84% | 90.22% |
| MATH | 60.8% | 55.64% |
| HumanEval+ | 67.65% | 60.7% |
| MBPP | 69.8% | 69% |
| BBH | 81.07% | 84.48% |
| GPQA | 39.9% | 47.97% |
| EvalPlus | 72.23% | 66.25% |
| MultiPL-E | 61.69% | 58.3% |
| MMLU-Pro | 61.03% | 55.1% |
| CRUX-O | 68.6% | 67.8% |
| MGSM | 79.2% | 78.12% |
| MMMLU | 79.69% | 82.4% |
Official release of Qwen3 series including 8B-Base model
Announcement and availability on HuggingFace, GitHub, ModelScope
Qwen3-8B-Base is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
The /think mode generates chain-of-thought reasoning wrapped in <think> tags before final outputs, typically adding 300-500ms latency for complex queries but improving accuracy by 15-20% on reasoning tasks like MATH where it scores 60.8%. Use /no_think for direct responses in production APIs where sub-100ms latency matters more than reasoning transparency. The model seamlessly switches modes mid-conversation, making it ideal for applications that need both quick factual retrieval and complex problem-solving.
At $0.18 input/$0.20 output per million tokens, Qwen3-8B costs approximately $190 for processing 1M tokens (assuming 50/50 input/output split), compared to GPT-4o's $5,000 input/$15,000 output pricing - making it 52-78x cheaper. For a startup processing 100M tokens monthly, this translates to $19,000 with Qwen3-8B versus $1M+ with GPT-4o. The 89.84% GSM8K score proves competitive reasoning capability at this price point.
GQA reduces the number of query heads from 32 to 8 while maintaining 32 key-value heads, cutting KV cache memory by 75% and enabling the full 32K context window on consumer GPUs with 24GB VRAM. This architectural choice, combined with QK-Norm for training stability and YaRN position embeddings, allows linear scaling to 128K tokens with minimal performance degradation. The model maintains 81.07% on BBH despite the optimization, only 3.41 points behind the 4x larger Qwen2.5-32B.
Qwen3-8B achieves 67.65% on HumanEval+ and 72.23% on EvalPlus, outperforming Qwen2.5-32B by 6.95 and 5.98 percentage points respectively despite being 75% smaller. The model supports 61.69% accuracy across MultiPL-E's language suite including Python, JavaScript, Java, C++, and Rust. Unlike pure coding models, it maintains strong general reasoning (76.89% MMLU) making it suitable for full-stack development tasks that require both code generation and system design discussions.
While supporting 119 languages, performance varies significantly - achieving 79.2% on MGSM (multilingual grade school math) but only 79.69% on MMMLU versus 82.4% for Qwen2.5-32B, indicating a 2.71 point degradation in non-English contexts. Low-resource languages like Swahili or Welsh show 20-30% accuracy drops compared to English benchmarks. For production multilingual applications, expect near-native performance in Chinese, English, Spanish, and French, but consider language-specific fine-tuning for languages beyond the top 20.