These pages track pre-release signals, not confirmed launch data. Provider announcements and official model cards are stronger evidence than mentions in research papers, and pricing, benchmarks, and availability remain unconfirmed until release.

Available

Qwen3-VL-Thinking

Name: Qwen3-VL-Thinking
Author: Alibaba

Multimodal by Alibaba

What We Know

Qwen3-VL-Thinking achieves 96.5% on DocVQA and 85.9% on MathVista-Mini, outperforming the 72B Qwen2.5-VL by 6.5 percentage points despite using only 22B active parameters from its 235B MoE architecture. The model introduces a hybrid thinking/non-thinking framework that processes 256K tokens natively (expandable to 1M) with text-based time alignment for frame-level video comprehension - enabling precise timestamp extraction from 2-hour videos. At $0.26/$2.60 per million tokens (input/output), it costs 10x less than GPT-4V while matching or exceeding performance on mathematical reasoning (85.7% on AIME 2024) and multimodal tasks, though it trails GPT-5 by 9.1 points on MMMU-Pro.

Provider

Alibaba

Benchmark Performance

Benchmark	Qwen3-VL-Thinking	Comparison
MMMU	78.1%	77.7%
MathVista-Mini	85.9%	79.4%
MathVision	70.2%	64.3%
MMMU-Pro	69.3%	78.4%
LiveCodeBench v5	70.7%	-
AIME 2024	85.7%	-
AIME 2025	81.5%	-
BFCL v3	70.8%	-
DocVQA	96.5%	-
OCRBench	875%	-
ScreenSpot Pro	61.8%	-
AndroidWorld	63.7%	-

Pricing

Input

$0.26

per 1M tokens

Output

$2.60

per 1M tokens

Capabilities & Features

visionreasoninglong_contextvideo_understandingimage_inputtool_useagent_interactioncodingmultimodal_reasoning3d_groundinggui_controlNative 256K token context window expandable to 1M tokensHybrid thinking/non-thinking modes in unified frameworkEnhanced interleaved-MRoPE for spatial-temporal modelingDeepStack multi-level ViT feature integrationText-based time alignment for video understandingVisual agent capabilities for PC/mobile GUI controlMulti-language OCR support (39 languages)Long-form video comprehension with frame-level precision3D spatial reasoning and grounding capabilities

Timeline

April 29, 2025

Qwen3 family announcement

September 23, 2025

Qwen3-VL-235B-A22B-Thinking official release

October 4, 2025

Qwen3-VL-30B-A3B-Thinking release

October 15, 2025

Qwen3-VL-4B/8B-Thinking release

October 21, 2025

Qwen3-VL-2B/32B-Thinking release

November 27, 2025

Qwen3-VL Technical Report published

Resources

Primary source →Qwen3-VL Technical Report →Qwen3-VL GitHub Repository →Alibaba Introduces Qwen3 Official Announcement →OpenRouter Qwen3-VL-235B-A22B-Thinking API Pricing →Paper →

Verification Status

Qwen3-VL-Thinking is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.

More from Alibaba

All Alibaba Models Compare AI Models API Pricing

Frequently Asked Questions

Qwen3-VL-Thinking scores 78.1% on MMMU (vs Qwen2.5-VL's 77.7%), 85.9% on MathVista-Mini (6.5 point improvement), and 70.2% on MathVision (5.9 point gain). However, it scores 69.3% on MMMU-Pro compared to GPT-5's 78.4%, indicating a 9.1 point gap on more complex reasoning tasks. The model excels at document understanding with 96.5% on DocVQA and achieves 875 on OCRBench, supporting 39 languages for OCR tasks.

At $0.26 per million input tokens and $2.60 per million output tokens, Qwen3-VL-Thinking costs approximately 10x less than GPT-4V ($10/$30) and 5x less than Claude 3.5 Sonnet ($3/$15). For a typical document processing workload of 10M input tokens and 1M output tokens daily, you'd pay $5.20/day with Qwen3-VL versus $130/day with GPT-4V - saving $3,750 per month while achieving 96.5% accuracy on DocVQA tasks.

The Mixture-of-Experts design activates only 22B of 235B total parameters per forward pass, requiring approximately 44GB of GPU memory in FP16 (vs 470GB for the full model). This architecture enables DeepStack multi-level ViT feature integration and interleaved-MRoPE for spatial-temporal modeling while maintaining inference speeds comparable to dense 20B models. The native 256K context window processes long videos without chunking, though expanding to 1M tokens increases memory requirements proportionally.

While scoring 70.7% on LiveCodeBench v5 and 70.8% on BFCL v3 for function calling, Qwen3-VL-Thinking lags behind code-specialized models like DeepSeek-Coder-V2.5. The model achieves 61.8% on ScreenSpot Pro and 63.7% on AndroidWorld for GUI automation - functional but not state-of-the-art. Its March 2025 training cutoff means it lacks knowledge of recent events, and the 69.3% MMMU-Pro score indicates struggles with graduate-level multimodal reasoning compared to GPT-5's 78.4%.

The unified framework allows switching between standard inference (non-thinking) for simple tasks and chain-of-thought reasoning (thinking) for complex problems without separate model versions. Thinking mode improves AIME 2024 performance to 85.7% and AIME 2025 to 81.5% but increases latency by approximately 3-5x and token usage by 10-20x. Use non-thinking mode for OCR (875 OCRBench score), basic Q&A, and real-time applications; activate thinking mode for mathematical proofs, multi-step reasoning, or complex video analysis requiring frame-level precision.

What We Know

Provider

Alibaba