Multimodal by Alibaba
Qwen3-VL-Thinking-30A3B achieves 99.5% accuracy on 2-hour video needle-in-haystack tests while using only 3B active parameters from its 30B MoE architecture, outperforming GPT-5 by 4.5 points on MathVista (85.8 vs 81.3) and 8.8 points on MathVision (74.6 vs 65.8). At $0.13 per million input tokens, it costs 87% less than GPT-4o\'s vision pricing while delivering superior performance on mathematical reasoning and near-perfect 875/1000 OCR accuracy across 32 languages. The model\'s Interleaved-MRoPE architecture enables native 256K context processing without chunking, making it the first production model capable of analyzing feature-length videos in a single pass.
| Benchmark | Qwen3-VL-Thinking-30A3B | Comparison |
|---|---|---|
| MathVista | 85.8% | 81.3% |
| MathVision | 74.6% | 65.8% |
| MMMU-Pro | 69.3% | 78.4% |
| DocVQA | 96.5% | 94.2% |
| OCRBench | 875% | - |
| ScreenSpot Pro | 61.8% | - |
| AndroidWorld | 63.7% | - |
| MMLongBench-Doc | 56.2% | - |
| CharXiv (Description) | 90.5% | - |
| CharXiv (Reasoning) | 66.2% | - |
| Video Needle-in-Haystack (2-hour) | 99.5% | - |
| Video Needle-in-Haystack (30-min) | 100% | - |
Qwen3-VL-30B-A3B-Thinking released
Qwen3-VL Technical Report published
Qwen3-VL-Thinking-30A3B is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
The model uses Interleaved-MRoPE (Multi-Resolution Rotary Position Embedding) to efficiently encode spatial-temporal relationships across 256K tokens natively, achieving 99.5% accuracy on 2-hour video needle-in-haystack tests. Unlike models that chunk videos into segments, Qwen3-VL processes entire videos in one pass using DeepStack multi-level ViT features that maintain fine-grained temporal resolution. At roughly 1 token per video frame, a 2-hour video at 30fps consumes approximately 216K tokens, leaving 40K tokens for prompts and responses within the native context window.
Qwen3-VL-Thinking costs $0.13 per million input tokens and $1.56 per million output tokens, compared to GPT-4o's $1.00/$3.00 pricing for vision tasks. For a typical document analysis task processing 100 pages (approximately 50K tokens input, 5K output), Qwen3-VL costs $0.0143 versus GPT-4o's $0.065, representing a 78% cost reduction. The model achieves 96.5% on DocVQA, surpassing its smaller 4B variant by 2.3 points while maintaining the same aggressive pricing structure.
Qwen3-VL-Thinking scores 69.3 on MMMU-Pro versus GPT-5's 78.4, a 9.1 point deficit, while simultaneously outperforming GPT-5 on MathVista by 4.5 points and MathVision by 8.8 points. This divergence stems from MMMU-Pro's focus on multi-disciplinary college-level reasoning across subjects like physics, chemistry, and engineering, where GPT-5's broader training corpus provides an advantage. The MoE architecture's 3B active parameters excel at specialized mathematical visual reasoning but show limitations on tasks requiring diverse disciplinary knowledge that would benefit from activating more of the 30B total parameters.
The Thinking variant implements native step-by-step reasoning traces during inference, similar to OpenAI's o1 models, but optimized for visual tasks with explicit spatial grounding outputs. On CharXiv reasoning tasks, it achieves 66.2% accuracy compared to 90.5% on description tasks, showing a 24.3 point delta that highlights how thinking mode excels at complex multi-step visual problems. The model generates intermediate reasoning tokens that aren't charged in the API pricing, making complex visual reasoning tasks cost-effective at scale.
Qwen3-VL-Thinking achieves 63.7% on AndroidWorld and 61.8% on ScreenSpot Pro, positioning it competitively for GUI automation tasks despite not being specialized for this domain. The model's 32-language OCR support with 875/1000 accuracy enables robust text extraction from UI elements across international applications. For comparison, these scores exceed many dedicated 7B GUI models while offering the flexibility to handle mixed workloads including document parsing, video analysis, and visual coding within the same deployment.