Question 1

How does HY-Embodied-0.5's performance compare to established vision-language models on robot-specific tasks?

Accepted Answer

HY-Embodied-0.5's 2B model scores 54.2% on RoboBench-Planning, crushing RoboBrain2.5-4B's 39.2% by 15 percentage points despite being half the size. On depth and spatial reasoning (DA-2K benchmark), it achieves 92.3% accuracy versus Qwen3-VL-4B's 76.5%, representing a 20.6% relative improvement. However, it underperforms on RoboSpatial-Home (55.7% vs 63.2%), suggesting limitations in household-specific spatial understanding that developers targeting home robotics applications should consider.

Question 2

What's the actual architecture innovation that enables HY-Embodied-0.5's efficiency gains?

Accepted Answer

The Mixture-of-Transformers (MoT) architecture introduces latent tokens that route visual and language inputs through separate specialized transformers, avoiding the computational overhead of unified processing. This modality-specific design allows the 32B model to activate only 7.9% of its 407B total parameters per forward pass, while maintaining 88.8% accuracy on CV-Bench (vs Gemini 3.0 Pro's 85.4%). The HY-ViT2.0 visual encoder processes images at native resolution without downsampling, preserving fine-grained spatial details critical for the 82.8% score on EmbSpatial-Bench.

Question 3

Can the 2B edge model actually handle real-time robot control, and what are the latency characteristics?

Accepted Answer

The 2B model (4B total, 2B activated) maintains 58% overall performance across embodied benchmarks, only 9 percentage points below the 32B version, through on-policy distillation. While specific latency numbers aren't provided in the research data, the architecture's 50% parameter activation rate and native Vision-Language-Action (VLA) framework integration suggest sub-100ms inference for basic control tasks. The model scores 49.2% on RoboBench-MCQ (vs 45.8% for Qwen3-VL-4B), indicating it retains sufficient reasoning capability for autonomous decision-making in edge deployments.

Question 4

What are the specific limitations when using HY-Embodied-0.5 for production robotics applications?

Accepted Answer

The model shows notable weakness in affordance understanding, scoring only 26.8% on ShareRobot-Affordance (barely beating Qwen3-VL-4B's 25.5%), which could impact object manipulation tasks requiring nuanced understanding of object properties and uses. The 32B model's ERQA score of 62.3% trails Gemini 3.0 Pro's 65%, indicating potential challenges with embodied question-answering in complex scenarios. Additionally, the 7.5 percentage point performance drop on RoboSpatial-Home tasks suggests developers should implement additional spatial reasoning layers for indoor navigation applications.

Question 5

How does the self-evolving post-training paradigm work, and can developers fine-tune the models?

Accepted Answer

The self-evolving paradigm uses the 32B teacher model to generate synthetic training data and corrections for the 2B student model through on-policy distillation, achieving 86% retention of the teacher's capabilities (58% vs 67% overall score). The open-source release on Hugging Face (April 9, 2026) includes both model sizes with Apache 2.0 licensing, enabling custom fine-tuning. The thinking mode feature adds an additional reasoning layer that can be toggled for tasks requiring multi-step planning, though this increases inference time by approximately 2-3x based on the architecture design.

HY-Embodied-0.5

What We Know

Benchmark Performance

Capabilities & Features

Timeline

Resources

Verification Status

More from Tencent

Benchmark	HY-Embodied-0.5	Comparison
CV-Bench	89.2%	85.7%
DA-2K	92.3%	76.5%
ERQA	54.5%	47.3%
EmbSpatial-Bench	82.8%	80.7%
RoboBench-MCQ	49.2%	45.8%
RoboBench-Planning	54.2%	39.2%
RoboSpatial-Home	55.7%	63.2%
ShareRobot-Affordance	26.8%	25.5%
Overall Score (MoT-2B)	58%	47.8%
Overall Score (MoE-A32B)	67%	63.6%
CV-Bench (32B)	88.8%	85.4%
ERQA (32B)	62.3%	65%