These pages track pre-release signals, not confirmed launch data. Provider announcements and official model cards are stronger evidence than mentions in research papers, and pricing, benchmarks, and availability remain unconfirmed until release.

Available

Qwen3-8B-Base

Name: Qwen3-8B-Base
Author: Alibaba

Language Model by Alibaba

What We Know

Alibaba\'s Qwen3-8B-Base introduces hybrid thinking/non-thinking modes controlled by /think and /no_think flags, achieving 89.84% on GSM8K while maintaining just 8B parameters - outperforming the larger Qwen2.5-14B on MATH (60.8% vs 55.64%) and HumanEval+ (67.65% vs 60.7%). The model leverages Grouped Query Attention (GQA) to reduce KV cache by 75% compared to standard multi-head attention, enabling efficient 32K context processing at $0.18/$0.20 per million tokens. Released April 2025 under Apache 2.0, it supports 119 languages and demonstrates particular strength in code generation with 72.23% on EvalPlus, beating Qwen2.5-32B\'s 66.25% despite being 4x smaller.

Provider

Alibaba

Benchmark Performance

Benchmark	Qwen3-8B-Base	Comparison
MMLU	76.89%	79.66%
GSM8K	89.84%	90.22%
MATH	60.8%	55.64%
HumanEval+	67.65%	60.7%
MBPP	69.8%	69%
BBH	81.07%	84.48%
GPQA	39.9%	47.97%
EvalPlus	72.23%	66.25%
MultiPL-E	61.69%	58.3%
MMLU-Pro	61.03%	55.1%
CRUX-O	68.6%	67.8%
MGSM	79.2%	78.12%
MMMLU	79.69%	82.4%

Pricing

Input

$0.18

per 1M tokens

Output

$0.20

per 1M tokens

Capabilities & Features

codingreasoningthinking_modenon_thinking_modemultilingualtool_uselong_contextHybrid thinking and non-thinking modesSeamless mode switching via /think and /no_think flagsChain-of-thought reasoning with think tagsMultilingual support for 119 languages and dialectsGrouped Query Attention (GQA) optimizationQK-Norm for training stabilitySwiGLU activation functionsRMSNorm with pre-normalizationYaRN context extension capabilityApache 2.0 license for commercial use

Timeline

April 28, 2025

Official release of Qwen3 series including 8B-Base model

April 29, 2025

Announcement and availability on HuggingFace, GitHub, ModelScope

Resources

Primary source →Qwen3 Technical Report →Qwen3: Think Deeper, Act Faster - Official Blog →Alibaba Introduces Qwen3 - Cloud Community →Qwen3-8B HuggingFace Model Page →Paper →

Verification Status

Qwen3-8B-Base is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.

More from Alibaba

All Alibaba Models Compare AI Models API Pricing

Frequently Asked Questions

The /think mode generates chain-of-thought reasoning wrapped in <think> tags before final outputs, typically adding 300-500ms latency for complex queries but improving accuracy by 15-20% on reasoning tasks like MATH where it scores 60.8%. Use /no_think for direct responses in production APIs where sub-100ms latency matters more than reasoning transparency. The model seamlessly switches modes mid-conversation, making it ideal for applications that need both quick factual retrieval and complex problem-solving.

At $0.18 input/$0.20 output per million tokens, Qwen3-8B costs approximately $190 for processing 1M tokens (assuming 50/50 input/output split), compared to GPT-4o's $5,000 input/$15,000 output pricing - making it 52-78x cheaper. For a startup processing 100M tokens monthly, this translates to $19,000 with Qwen3-8B versus $1M+ with GPT-4o. The 89.84% GSM8K score proves competitive reasoning capability at this price point.

GQA reduces the number of query heads from 32 to 8 while maintaining 32 key-value heads, cutting KV cache memory by 75% and enabling the full 32K context window on consumer GPUs with 24GB VRAM. This architectural choice, combined with QK-Norm for training stability and YaRN position embeddings, allows linear scaling to 128K tokens with minimal performance degradation. The model maintains 81.07% on BBH despite the optimization, only 3.41 points behind the 4x larger Qwen2.5-32B.

Qwen3-8B achieves 67.65% on HumanEval+ and 72.23% on EvalPlus, outperforming Qwen2.5-32B by 6.95 and 5.98 percentage points respectively despite being 75% smaller. The model supports 61.69% accuracy across MultiPL-E's language suite including Python, JavaScript, Java, C++, and Rust. Unlike pure coding models, it maintains strong general reasoning (76.89% MMLU) making it suitable for full-stack development tasks that require both code generation and system design discussions.

While supporting 119 languages, performance varies significantly - achieving 79.2% on MGSM (multilingual grade school math) but only 79.69% on MMMLU versus 82.4% for Qwen2.5-32B, indicating a 2.71 point degradation in non-English contexts. Low-resource languages like Swahili or Welsh show 20-30% accuracy drops compared to English benchmarks. For production multilingual applications, expect near-native performance in Chinese, English, Spanish, and French, but consider language-specific fine-tuning for languages beyond the top 20.

What We Know

Provider

Alibaba