Multimodal by Alibaba
Qwen3-VL-4B achieves 94.2% on DocVQA and 92.9% on ScreenSpot while maintaining a compact 4.83B parameter footprint, outperforming Gemma 3 4B by 18.4 percentage points on document understanding tasks. The model introduces Interleaved-MRoPE for spatial-temporal modeling and DeepStack integration for multi-level ViT features, enabling accurate video timestamp alignment across 2-hour videos with 256K context window. At $0.10 per million input tokens (10x cheaper than GPT-4V), it delivers production-ready vision-language capabilities including 32-language OCR, GUI automation, and visual code generation in both Instruct and Thinking mode variants with FP8 quantization support.
| Benchmark | Qwen3-VL-4B | Comparison |
|---|---|---|
| MMMU (Val) | 59.5% | 52.3% |
| AI2D | 84.9% | 74.8% |
| MathVista-Mini | 68.2% | - |
| MMLU-Pro | 42.1% | 35.7% |
| GPQA | 44.8% | 38.2% |
| DocVQA | 94.2% | 75.8% |
| ScreenSpot | 92.9% | - |
| MMBench-V1.1 | 86.7% | - |
| MMLU-Redux | 86% | - |
| Artificial Analysis Intelligence Index | 14% | 11% |
Qwen3-VL-30B-A3B models released
Qwen3-VL-4B (Instruct/Thinking) and 8B variants released
Qwen3-VL Technical Report published
Qwen3-VL-4B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Qwen3-VL-4B scores 59.5% on MMMU validation set (7.2 points above Gemma 3 4B), 84.9% on AI2D diagram understanding (10.1 point lead), and 68.2% on MathVista-Mini for mathematical reasoning. Despite being 4B parameters, it achieves 86.7% on MMBench-V1.1 and scores 14 on the Artificial Analysis Intelligence Index, outperforming the average 8B open-weight VLM by 3 points, demonstrating efficient parameter utilization through its Dense Transformer architecture.
At $0.10 per million input tokens and $1.00 per million output tokens, Qwen3-VL-4B costs approximately 10-50x less than GPT-4V ($10-50 per million tokens) and 5-25x less than Claude 3 Vision models. For a typical document processing workload of 10 million images monthly with 500 tokens per image response, you'd pay $1,500 with Qwen3-VL-4B versus $15,000-75,000 with GPT-4V, while maintaining 94.2% DocVQA accuracy and supporting FP8 quantization for further cost reduction on compatible hardware.
Interleaved-MRoPE enables Qwen3-VL-4B to maintain temporal coherence across 256K token sequences, allowing analysis of 2-hour videos with precise timestamp alignment at sub-second granularity. The DeepStack integration extracts multi-level ViT features from different depths, capturing both low-level visual details and high-level semantic understanding, which explains the model's 92.9% ScreenSpot accuracy for GUI understanding tasks where both spatial precision and semantic context are critical.
With 4.83B parameters, Qwen3-VL-4B shows a 6.4 percentage point gap on GPQA (44.8% vs theoretical 50%+ for larger models) and 42.1% on MMLU-Pro, indicating limitations on complex reasoning tasks requiring extensive world knowledge. The model lacks native support for audio modalities and 3D scene reconstruction beyond basic spatial grounding, and while it supports 32 languages for OCR, performance on non-Latin scripts drops 15-20% compared to English based on internal benchmarks.
The Thinking variant adds chain-of-thought reasoning capabilities, improving performance on mathematical and logical tasks by approximately 8-12% based on MathVista-Mini scores (68.2%), but increases latency by 2-3x and token usage by 4-5x. For GUI automation and coding tasks where step-by-step reasoning improves accuracy, Thinking mode generates more reliable HTML/CSS/JS code, while Instruct mode excels at rapid document extraction and OCR tasks where speed matters more than reasoning depth.