DualVLA:
Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action

Zhen Fang^1*, Zhuoyang Liu^2*, Jiaming Liu^2†, Hao Chen³, Yu Zeng¹,
Shiting Huang¹, Zehui Chen^1†, Lin Chen¹, Shanghang Zhang^2#, Feng Zhao^1#,

¹MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC
²State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University ³CUHK

^*Equal Contribution ^†Project Lead ^#Corresponding Authors

Paper Code Data Model

Abstract

To build a generalizable Vision–Language–Action (VLA) model with reasoning capability, a common approach is to first train a specialist VLA on robot demonstrations to acquire reliable manipulation skills, and then introduce mixed annotated-robot data with multimodal data to restore general reasoning. However, we observe that the resulting reasoning VLA exhibits degraded action performance compared to the specialist VLA before fine-tuning. We define this phenomenon as Action Degeneration.

To tackle this issue, we propose DualVLA, which improves action performance through carefully designed post-training while preserving the reasoning ability. We first propose a dual-layer data pruning method to remove redundant embodied reasoning and alleviate its adverse guidance on action learning. To further enhance the model's action generation capabilities, we adopt a dual-teacher adaptive distillation strategy that assigns different supervision signals to different data domains and maintains its reasoning ability.

To fill the evaluation gap of generalist VLAs, we introduce VLA Score, which decouples VLA capabilities into reasoning, intention, action, and alignment, enabling a more fine-grained evaluation. Experiments show that DualVLA achieves an average success rate of 61.0 in SimplerEnv and an average score of 65.4 across eight competitive multimodal benchmarks, demonstrating a stronger balance between action execution and multimodal understanding.

Framework

The overview of DualVLA. DualVLA introduces a sparse yet information-dense embodied reasoning dataset by integrating video event prediction with kinematic cues, selectively retaining reasoning segments that are truly relevant to action execution and suppressing redundant or low-value descriptions that typically hinder policy learning. Building on this foundation, DualVLA adopts a dual-teacher distillation strategy: an action teacher that delivers fine-grained manipulation guidance for precise control, and a reasoning teacher that maintains broad multimodal reasoning ability without overfitting to specific tasks. By jointly enhancing actionable skills and preserving general reasoning, DualVLA demonstrates strong adaptability and performance across both simulation benchmarks and real-world robotic evaluations, offering a scalable paradigm for robust vision-language-action models.

Evaluation: VLA Score

VLA Score evaluation pipeline overview: Given the policy trajectory, task description, and optional reasoning as input, VLA Score first performs dual retrieval to fetch task-relevant textual examples and visually similar trajectories from a curated knowledge base. The retrieved samples serve as few-shot context for the VLM judge, which evaluates the trajectory along four dimensions: Reasoning, Action, Intention, and Alignment. These scores are then combined with the simulation outcome to produce the final VLA Score.

Reasoning

$R$: How to Think

Measures the correctness, logical consistency, and usefulness of the reasoning process.

Action

$A$: How to Act

Measures the coherence and smoothness of the action sequence.

Intension

$I$: How to Try

Determines whether the model’s actions contribute constructively to solving the task.

Alignment

$RA$: How to Align

Measures how well the action sequences align with reasoning content.

Combined with the trajectory’s simulation result $B$, we calculate the overall VLAs Score using the following formula: $$ {\small \begin{equation} \text{VLA Score } = \begin{cases} \left( \dfrac{R + A \cdot I}{2} \right) \cdot RA \cdot B, & \pi_\theta \in \text{RVLA} \\[4pt] A \cdot I \cdot RA \cdot B, & \pi_\theta \notin \text{RVLA} \end{cases} \end{equation} , \text{where } B = \begin{cases} 1, & \text{if success} \\ 0, & \text{if fall} \end{cases} } $$

Results

SimplerEnv Results

Comparison of manipulation success rates between DualVLA and specialist & generalist baselines in SimplerEnv. Google Robot and WidowX Robot denote two embodiments in SimplerEnv. VM refers to visual matching and VA refers to variance aggregation. $^{\dagger}$ denotes models without released checkpoints, results are taken from their papers.

Real-World Tasks

To systematically evaluate our approach, we designed two real-world dual-arm tasks on the Galaxea R1-lite robot. For the dual-arm setting, three RealSense 455 camera are used to get image observations, one on the head and two on the left and right wrist, respectively. The model takes the images of three views as image observation, and outputs a 14-DoF vector as the dual-arm action. We designe two complex tasks:(1) Move Objects, (2) Handover Objects. Both tasks require the model to move three objects from right to left and follow the order in the language instruction. For each task, we collected 50 high-quality demonstration trajectories. We test 10 rollouts for each task and use the average success rate as the quantitative result. Result shows that DualVLA significantly improves manipulation performance, raising the average success rate from 45.0% to 60.0% in real-world tasks. The gains in both Move and Handover tasks demonstrate more reliable and coordinated action generation in real robotic settings.

VLA Score Results

Our model, DualVLA, achieves the highest score in VLA Score among reasoning VLAs. Reasoning VLAs display a substantially higher reasoning score than both their action and alignment scores. Failure case analysis shows that their low action scores mainly stem from the inability to produce effective execution throughout the trajectory: although these models often reason correctly about how a task should be completed, they still struggle to approach or manipulate the target properly. DualVLA inherits strong reasoning capability from the reasoning teacher while learning refined and smooth action behaviors from the action teacher—capabilities that cross-entropy training alone cannot provide. We argue that this combination is a critical factor contributing to the effectiveness of DualVLA.

Visualizations

SimplerEnv

Visual examples of SimplerEnv Google robot tasks driven by DualVLA (top) and WidowX robot tasks (bottom). Compared to the generalist baseline, DualVLA generates more accurate and smoother action sequences that better align with the reasoning process, leading to successful task completion.

Real-World Tasks

Demonstration videos of two real-world manipulation tasks: the top row shows moving objects, and the bottom row shows handover object manipulation.

Multimodal Reasoning

Distillation from the reasoning teacher effectively preserves multimodal reasoning capabilities, with no significant difference compared to the reasoning teacher or the base VLM.

BibTeX

@article{fang2026dual,
  title={DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action},
  author={Zhen Fang and Zhuoyang Liu and Jiaming Liu and Hao Chen and Yu Zeng and Shiting Huang and Zehui Chen and Lin Chen and Shanghang Zhang and Feng Zhao},
  journal={arXiv:2511.22134},
  year={2025}
}