Language Model by Alibaba
Qwen3.6-35B-A3B uses a sparse MoE architecture with only 3B active parameters out of 35B total, achieving 49.5% on SWE-bench Pro (13.8 points above Gemma4-31B) while costing just $0.16/$0.90 per million tokens. The model implements Gated Delta Networks combined with early fusion multimodal training, enabling native vision-language capabilities with 92% accuracy on RefCOCO spatial reasoning tasks. With both thinking and non-thinking modes plus a 262K context window, it targets the gap between open-source models and GPT-5.2 (which beats it by only 4 points on AIME 2026) at 1/50th the typical cost of frontier models.
| Benchmark | Qwen3.6-35B-A3B-UD-Q4_K_S.gguf | Comparison |
|---|---|---|
| SWE-bench Verified | 73.4% | 75% |
| SWE-bench Pro | 49.5% | 35.7% |
| Terminal-Bench 2.0 | 51.5% | 42.9% |
| AIME 2026 | 92.7% | 96.7% |
| GPQA Diamond | 86% | 93.3% |
| MMLU Pro | 85.2% | - |
| HMMT Feb 2026 | 83.6% | 94.8% |
| Humanity\'s Last Exam (HLE) | 21.4% | - |
| RefCOCO (Spatial Intelligence) | 92% | - |
| ODInW13 (Object Detection) | 50.8% | - |
| Artificial Analysis Intelligence Index | 37% | - |
Qwen3.5 series announcement
Qwen3.5-35B-A3B medium series release
Qwen3.6-35B-A3B release with enhanced agentic coding
Qwen3.6-35B-A3B-UD-Q4_K_S.gguf is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
The 35B total/3B active parameter design achieves 85.2% on MMLU Pro while requiring only the memory footprint of a 3B model during inference. This sparse activation pattern enables deployment on single A100 GPUs with INT4 quantization, delivering 73.4% on SWE-bench Verified (within 1.6 points of the larger Qwen3.5-27B). The Gated Delta Networks allow dynamic expert routing based on input complexity, explaining why it excels at coding tasks (51.5% on Terminal-Bench 2.0) despite its compact active size.
At $0.16 input/$0.90 output per million tokens, Qwen3.6-35B-A3B costs approximately 40x less than GPT-4o's typical $5-15 per million token pricing. For a workload processing 100M tokens daily (70/30 input/output split), that's $38.20/day versus $800-1500/day. The tradeoff: you lose 4-11 points on mathematical reasoning benchmarks (92.7% vs 96.7% on AIME 2026) but gain self-hosting capability and 262K context windows without rate limits.
The model scores 49.5% on SWE-bench Pro (versus Gemma4-31B's 35.7%) and 51.5% on Terminal-Bench 2.0 (versus 42.9%), indicating strong performance on repository-level code understanding and terminal-based development workflows. Its thinking mode preservation allows iterative debugging across multiple turns, while the 262K context window handles entire codebases. However, it trails Qwen3.5-27B by 1.6 points on SWE-bench Verified (73.4% vs 75%), suggesting the larger model may be better for complex refactoring tasks.
With 92% accuracy on RefCOCO spatial intelligence tasks and 50.8% on ODInW13 object detection, Qwen3.6-35B-A3B's early fusion training delivers competitive vision performance without requiring separate encoders. The unified vision-language foundation means you can process images and text in the same context window without mode switching. This native multimodality comes at no additional inference cost, unlike models that bolt on vision capabilities post-training.
The Q4_K_S quantization reduces the 35B parameter model to approximately 20GB, fitting on consumer GPUs with 24GB VRAM. Benchmark degradation is minimal: expect 2-3% drops on reasoning tasks based on the architecture's robustness to quantization. The sparse MoE design means only 3B parameters activate per token, maintaining inference speeds of 40-50 tokens/second on RTX 4090 hardware even with the full 262K context loaded.