RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models
RoboAlign addresses the modality gap between high-level language reasoning and low-level robot control in Vision-Language-Action (VLA) models. The framework first uses supervised fine-tuning to teach a multimodal LLM to generate FAST action tokens through zero-shot chain-of-thought reasoning, then applies Group Relative Policy Optimization (GRPO) to refine reasoning based on token-level action accuracy. This matters because prior work showed that improving embodied reasoning via language supervision often fails to translate into better robot performance or even degrades it.
The paper presents a compelling case that direct alignment with low-level actions via RL outperforms language-only or trajectory-based supervision, particularly on long-horizon tasks. The two-stage pipeline is principled and the gains on LIBERO Long (70.0% vs 65.6% for SFT-only) and CALVIN are consistent. However, the framing of data efficiency is misleading—the RL stage builds on 2.28M SFT samples, and the headline 17.5% improvement on LIBERO compares against the base model rather than the SFT checkpoint. Real-world evaluation is limited to four tasks with 96 trials each, which is modest evidence for the claimed 106.6% gain.
The core insight—that optimizing embodied reasoning purely through language supervision does not guarantee improvements in actual action generation—empirically holds. The two-stage design successfully transfers reasoning ability to FAST token generation (Table 1), and the RL stage demonstrably sharpens representations: KNN classification accuracy on underlying robot states improves from 43.23% to 69.79% (Table 8). The ablation showing that low-level action-based RL beats high-level language actions (86.8% vs 83.6% on LIBERO) and 2D trajectory prediction (85.1%) supports the central claim that direct action alignment is superior.
The data efficiency claim obscures that the 12.8K RL samples represent only the final tuning stage atop 2.28M SFT samples. The marginal gain from RL over the RoboAlign SFT baseline is roughly 10% on LIBERO (78.7% to 86.8%), not the headline 17.5% (which compares against the raw Qwen2.5-VL base). The comparison with ECoT (Table 7) suggests SFT-based alignment degrades performance, but this may reflect distribution mismatch rather than a fundamental SFT limitation. The real-world evaluation (Table 4) covers only four pick-and-place tasks, and the 106.6% relative improvement is calculated against a weak base model (32.3%) rather than the SFT model (55.2%).
The evidence supports the primary claim that RL-based low-level action alignment outperforms alternatives. Table 6 demonstrates superiority over language-based and visual-based RL alignment, particularly for long-horizon tasks (70.0% vs 58.2% and 64.6%). However, comparisons to prior VLA methods in Table 2 mix different backbones (e.g., OpenVLA, Octo) and training setups, making it unclear how much improvement stems from RoboAlign versus simply using Qwen2.5-VL. The paper does not adequately explain why RoboBrain 2.0—a model with higher embodied reasoning scores—performs worse as a VLA backbone, beyond attributing it to a vague modality gap.
Implementation details are reasonably thorough. The authors specify hyperparameters ($2\times 10^{-5}$ learning rate for SFT, $1\times 10^{-6}$ for RL), the use of EasyR1 for GRPO training, and GR00T-N1.5 for VLA conversion. Compute requirements are stated (8\times H200 GPUs for ~30 hours SFT, 1 hour RL). However, the RoboAlign VQA dataset generated by Gemini-2.5 Pro lacks reproducible prompts and filtering criteria, and the code is not released. The RL reward function depends on exact FAST token matching (Equation 2), which is deterministic but sensitive to tokenization nuances not fully specified.
Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.