Multimodal by Alibaba
Qwen3-VL-Thinking achieves 96.5% on DocVQA and 85.9% on MathVista-Mini, outperforming the 72B Qwen2.5-VL by 6.5 percentage points despite using only 22B active parameters from its 235B MoE architecture. The model introduces a hybrid thinking/non-thinking framework that processes 256K tokens natively (expandable to 1M) with text-based time alignment for frame-level video comprehension - enabling precise timestamp extraction from 2-hour videos. At $0.26/$2.60 per million tokens (input/output), it costs 10x less than GPT-4V while matching or exceeding performance on mathematical reasoning (85.7% on AIME 2024) and multimodal tasks, though it trails GPT-5 by 9.1 points on MMMU-Pro.
| Benchmark | Qwen3-VL-Thinking | Comparison |
|---|---|---|
| MMMU | 78.1% | 77.7% |
| MathVista-Mini | 85.9% | 79.4% |
| MathVision | 70.2% | 64.3% |
| MMMU-Pro | 69.3% | 78.4% |
| LiveCodeBench v5 | 70.7% | - |
| AIME 2024 | 85.7% | - |
| AIME 2025 | 81.5% | - |
| BFCL v3 | 70.8% | - |
| DocVQA | 96.5% | - |
| OCRBench | 875% | - |
| ScreenSpot Pro | 61.8% | - |
| AndroidWorld | 63.7% | - |
Qwen3 family announcement
Qwen3-VL-235B-A22B-Thinking official release
Qwen3-VL-30B-A3B-Thinking release
Qwen3-VL-4B/8B-Thinking release
Qwen3-VL-2B/32B-Thinking release
Qwen3-VL Technical Report published
Qwen3-VL-Thinking is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
Qwen3-VL-Thinking scores 78.1% on MMMU (vs Qwen2.5-VL's 77.7%), 85.9% on MathVista-Mini (6.5 point improvement), and 70.2% on MathVision (5.9 point gain). However, it scores 69.3% on MMMU-Pro compared to GPT-5's 78.4%, indicating a 9.1 point gap on more complex reasoning tasks. The model excels at document understanding with 96.5% on DocVQA and achieves 875 on OCRBench, supporting 39 languages for OCR tasks.
At $0.26 per million input tokens and $2.60 per million output tokens, Qwen3-VL-Thinking costs approximately 10x less than GPT-4V ($10/$30) and 5x less than Claude 3.5 Sonnet ($3/$15). For a typical document processing workload of 10M input tokens and 1M output tokens daily, you'd pay $5.20/day with Qwen3-VL versus $130/day with GPT-4V - saving $3,750 per month while achieving 96.5% accuracy on DocVQA tasks.
The Mixture-of-Experts design activates only 22B of 235B total parameters per forward pass, requiring approximately 44GB of GPU memory in FP16 (vs 470GB for the full model). This architecture enables DeepStack multi-level ViT feature integration and interleaved-MRoPE for spatial-temporal modeling while maintaining inference speeds comparable to dense 20B models. The native 256K context window processes long videos without chunking, though expanding to 1M tokens increases memory requirements proportionally.
While scoring 70.7% on LiveCodeBench v5 and 70.8% on BFCL v3 for function calling, Qwen3-VL-Thinking lags behind code-specialized models like DeepSeek-Coder-V2.5. The model achieves 61.8% on ScreenSpot Pro and 63.7% on AndroidWorld for GUI automation - functional but not state-of-the-art. Its March 2025 training cutoff means it lacks knowledge of recent events, and the 69.3% MMMU-Pro score indicates struggles with graduate-level multimodal reasoning compared to GPT-5's 78.4%.
The unified framework allows switching between standard inference (non-thinking) for simple tasks and chain-of-thought reasoning (thinking) for complex problems without separate model versions. Thinking mode improves AIME 2024 performance to 85.7% and AIME 2025 to 81.5% but increases latency by approximately 3-5x and token usage by 10-20x. Use non-thinking mode for OCR (875 OCRBench score), basic Q&A, and real-time applications; activate thinking mode for mathematical proofs, multi-step reasoning, or complex video analysis requiring frame-level precision.