DualVLA:
DualVLA:
To build a generalizable Vision–Language–Action (VLA) model with reasoning capability, a common approach is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then introduce mixed annotated-robot data with multimodal data to restore general reasoning. However, we observe that the resulting reasoning VLA exhibits degraded action performance compared to the specialist VLA before fine-tuning. We define this phenomenon as Action Degeneration.
To tackle this issue, we propose DualVLA, which improves action performance through carefully designed post-training while preserving the reasoning ability. We first propose a dual-layer data pruning method to remove redundant embodied reasoning and alleviate its adverse guidance on action learning. To further enhance the model's action generation capabilities, we adopt a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains and maintains its reasoning ability.
To fill the evaluation gap of generalist VLAs, we introduce VLA Score, which decouples VLA capabilities into reasoning, intention, action, and alignment, enabling a more fine-grained evaluation. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between action execution and multimodal understanding.
The overview of DualVLA. DualVLA introduces a sparse yet information-dense embodied reasoning dataset by integrating video event prediction with kinematic cues, selectively retaining reasoning segments that are truly relevant to action execution and suppressing redundant or low-value descriptions that typically hinder policy learning. Building on this foundation, DualVLA adopts a dual-teacher distillation strategy: an action teacher that delivers fine-grained manipulation guidance for precise control, and a reasoning teacher that maintains broad multimodal reasoning ability without overfitting to specific tasks. By jointly enhancing actionable skills and preserving general reasoning, DualVLA demonstrates strong adaptability and performance across both simulation benchmarks and real-world robotic evaluations, offering a scalable paradigm for robust vision-language-action models.
VLA Score evaluation pipeline overview: Given the policy trajectory, task description, and optional reasoning as input, VLA Score first performs dual retrieval to fetch task-relevant textual examples and visually similar trajectories from a curated knowledge base. The retrieved samples serve as few-shot context for the VLM judge, which evaluates the trajectory along four dimensions: Reasoning, Action, Intention, and Alignment. These scores are then combined with the simulation outcome to produce the final VLA Score.
Measures the correctness, logical consistency, and usefulness of the reasoning process.
Measures the coherence and smoothness of the action sequence.
Determines whether the model’s actions contribute constructively to solving the task.
Measures how well the action sequences align with reasoning content.
Our model, DualVLA, achieves the highest score in VLA Score among reasoning VLAs. Reasoning VLAs display a substantially higher reasoning score than both their action and alignment scores. Failure case analysis shows that their low action scores mainly stem from the inability to produce effective execution throughout the trajectory: although these models often reason correctly about how a task should be completed, they still struggle to approach or manipulate the target properly. DualVLA inherits strong reasoning capability from the reasoning teacher while learning refined and smooth action behaviors from the action teacher—capabilities that cross-entropy training alone cannot provide. We argue that this combination is a critical factor contributing to the effectiveness of DualVLA.
@article{fang2026dual,
title={DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action},
author={Zhen Fang and Zhuoyang Liu and Jiaming Liu and Hao Chen and Yu Zeng and Shiting Huang and Zehui Chen and Lin Chen and Shanghang Zhang and Feng Zhao},
journal={arXiv:2511.22134},
year={2025}
}