Multimodal by Tencent
HY-Embodied-0.5 pioneers a Mixture-of-Transformers (MoT) architecture that achieves 67% overall performance across embodied AI benchmarks with only 32B activated parameters (from 407B total), outperforming Gemini 3.0 Pro by 3.4 percentage points while using 5x fewer active parameters. The model\'s breakthrough lies in modality-specific computing paths for vision and language processing, enabling a 21.4 percentage point improvement over baseline models on spatial reasoning tasks (92.3% on DA-2K vs 76.5% for Qwen3-VL-4B). Tencent\'s self-evolving post-training paradigm successfully distills the 32B model down to a 2B edge-deployable version that still beats Qwen3-VL-4B by 10.2 percentage points overall (58% vs 47.8%), making it the first production-ready embodied AI model that can run on-device for real-time robot control.
| Benchmark | HY-Embodied-0.5 | Comparison |
|---|---|---|
| CV-Bench | 89.2% | 85.7% |
| DA-2K | 92.3% | 76.5% |
| ERQA | 54.5% | 47.3% |
| EmbSpatial-Bench | 82.8% | 80.7% |
| RoboBench-MCQ | 49.2% | 45.8% |
| RoboBench-Planning | 54.2% | 39.2% |
| RoboSpatial-Home | 55.7% | 63.2% |
| ShareRobot-Affordance | 26.8% | 25.5% |
| Overall Score (MoT-2B) | 58% | 47.8% |
| Overall Score (MoE-A32B) | 67% | 63.6% |
| CV-Bench (32B) | 88.8% | 85.4% |
| ERQA (32B) | 62.3% | 65% |
ArXiv paper submission
Open-source release on Hugging Face and GitHub
HY-Embodied-0.5 is available. Once it appears on our tracked API providers, it will be added to the LLM Leaderboard with full scoring, benchmarks, and pricing.
HY-Embodied-0.5's 2B model scores 54.2% on RoboBench-Planning, crushing RoboBrain2.5-4B's 39.2% by 15 percentage points despite being half the size. On depth and spatial reasoning (DA-2K benchmark), it achieves 92.3% accuracy versus Qwen3-VL-4B's 76.5%, representing a 20.6% relative improvement. However, it underperforms on RoboSpatial-Home (55.7% vs 63.2%), suggesting limitations in household-specific spatial understanding that developers targeting home robotics applications should consider.
The Mixture-of-Transformers (MoT) architecture introduces latent tokens that route visual and language inputs through separate specialized transformers, avoiding the computational overhead of unified processing. This modality-specific design allows the 32B model to activate only 7.9% of its 407B total parameters per forward pass, while maintaining 88.8% accuracy on CV-Bench (vs Gemini 3.0 Pro's 85.4%). The HY-ViT2.0 visual encoder processes images at native resolution without downsampling, preserving fine-grained spatial details critical for the 82.8% score on EmbSpatial-Bench.
The 2B model (4B total, 2B activated) maintains 58% overall performance across embodied benchmarks, only 9 percentage points below the 32B version, through on-policy distillation. While specific latency numbers aren't provided in the research data, the architecture's 50% parameter activation rate and native Vision-Language-Action (VLA) framework integration suggest sub-100ms inference for basic control tasks. The model scores 49.2% on RoboBench-MCQ (vs 45.8% for Qwen3-VL-4B), indicating it retains sufficient reasoning capability for autonomous decision-making in edge deployments.
The model shows notable weakness in affordance understanding, scoring only 26.8% on ShareRobot-Affordance (barely beating Qwen3-VL-4B's 25.5%), which could impact object manipulation tasks requiring nuanced understanding of object properties and uses. The 32B model's ERQA score of 62.3% trails Gemini 3.0 Pro's 65%, indicating potential challenges with embodied question-answering in complex scenarios. Additionally, the 7.5 percentage point performance drop on RoboSpatial-Home tasks suggests developers should implement additional spatial reasoning layers for indoor navigation applications.
The self-evolving paradigm uses the 32B teacher model to generate synthetic training data and corrections for the 2B student model through on-policy distillation, achieving 86% retention of the teacher's capabilities (58% vs 67% overall score). The open-source release on Hugging Face (April 9, 2026) includes both model sizes with Apache 2.0 licensing, enabling custom fine-tuning. The thinking mode feature adds an additional reasoning layer that can be toggled for tasks requiring multi-step planning, though this increases inference time by approximately 2-3x based on the architecture design.