RoboAlign: Learning Test-Time Reasoning for Language-Action Alignment in Vision-Language-Action Models

cs.AI Dongyoung Kim, Sumin Park, Woomin Song, Seungku Kim, Taeyoung Kim, Huiwon Jang, Jinwoo Shin, Jaehyung Kim, Younggyo Seo · Mar 22, 2026

What it does

Why it matters

The framework first uses supervised fine-tuning to teach a multimodal LLM to generate FAST action tokens through zero-shot chain-of-thought reasoning, then applies Group Relative Policy Optimization (GRPO) to refine reasoning based on...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

RoboAlign addresses the modality gap between high-level language reasoning and low-level robot control in Vision-Language-Action (VLA) models. The framework first uses supervised fine-tuning to teach a multimodal LLM to generate FAST action tokens through zero-shot chain-of-thought reasoning, then applies Group Relative Policy Optimization (GRPO) to refine reasoning based on token-level action accuracy. This matters because prior work showed that improving embodied reasoning via language supervision often fails to translate into better robot performance or even degrades it.

Critical review

Verdict

Bottom line

The paper presents a compelling case that direct alignment with low-level actions via RL outperforms language-only or trajectory-based supervision, particularly on long-horizon tasks. The two-stage pipeline is principled and the gains on LIBERO Long (70.0% vs 65.6% for SFT-only) and CALVIN are consistent. However, the framing of data efficiency is misleading—the RL stage builds on 2.28M SFT samples, and the headline 17.5% improvement on LIBERO compares against the base model rather than the SFT checkpoint. Real-world evaluation is limited to four tasks with 96 trials each, which is modest evidence for the claimed 106.6% gain.

“by performing RL-based alignment after SFT using less than 1% of the data, RoboAlign achieves performance improvements of 17.5%, 18.9%, and 106.6% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.”

paper · Abstract

What holds up

The core insight—that optimizing embodied reasoning purely through language supervision does not guarantee improvements in actual action generation—empirically holds. The two-stage design successfully transfers reasoning ability to FAST token generation (Table 1), and the RL stage demonstrably sharpens representations: KNN classification accuracy on underlying robot states improves from 43.23% to 69.79% (Table 8). The ablation showing that low-level action-based RL beats high-level language actions (86.8% vs 83.6% on LIBERO) and 2D trajectory prediction (85.1%) supports the central claim that direct action alignment is superior.

“optimizing embodied reasoning purely through language supervision does not guarantee improvements in actual action generation.”

paper · Section 1

“w/ RoboAlign (Ours) ... 69.79”

paper · Table 8

Main concerns

The data efficiency claim obscures that the 12.8K RL samples represent only the final tuning stage atop 2.28M SFT samples. The marginal gain from RL over the RoboAlign SFT baseline is roughly 10% on LIBERO (78.7% to 86.8%), not the headline 17.5% (which compares against the raw Qwen2.5-VL base). The comparison with ECoT (Table 7) suggests SFT-based alignment degrades performance, but this may reflect distribution mismatch rather than a fundamental SFT limitation. The real-world evaluation (Table 4) covers only four pick-and-place tasks, and the 106.6% relative improvement is calculated against a weak base model (32.3%) rather than the SFT model (55.2%).

“We define the reward as the arithmetic mean of two components: a format reward $r_{f}\in\{0,1\}$ indicating whether the output correctly adheres to the required reasoning format, and an accuracy reward $r_{a}\in[0,1]$ measuring FAST token prediction accuracy.”

paper · Section 4.2

Evidence and comparison

The evidence supports the primary claim that RL-based low-level action alignment outperforms alternatives. Table 6 demonstrates superiority over language-based and visual-based RL alignment, particularly for long-horizon tasks (70.0% vs 58.2% and 64.6%). However, comparisons to prior VLA methods in Table 2 mix different backbones (e.g., OpenVLA, Octo) and training setups, making it unclear how much improvement stems from RoboAlign versus simply using Qwen2.5-VL. The paper does not adequately explain why RoboBrain 2.0—a model with higher embodied reasoning scores—performs worse as a VLA backbone, beyond attributing it to a vague modality gap.

“w/ Action-base RL (Ours) ... 70.0”

paper · Table 6

“RoboBrain 2.0 ... yielded the lowest VLA performance”

paper · Table 2

Reproducibility

Implementation details are reasonably thorough. The authors specify hyperparameters ($2\times 10^{-5}$ learning rate for SFT, $1\times 10^{-6}$ for RL), the use of EasyR1 for GRPO training, and GR00T-N1.5 for VLA conversion. Compute requirements are stated (8\times H200 GPUs for ~30 hours SFT, 1 hour RL). However, the RoboAlign VQA dataset generated by Gemini-2.5 Pro lacks reproducible prompts and filtering criteria, and the code is not released. The RL reward function depends on exact FAST token matching (Equation 2), which is deterministic but sensitive to tokenization nuances not fully specified.

“r_{a}=\frac{1}{m}\max\{i\in\{1,\dots,m\}:T^{\text{gen}}_{1:i}=T^{\text{target}}_{1:i}\}”

paper · Equation 2

Abstract

Improving embodied reasoning in multimodal-large-language models (MLLMs) is essential for building vision-language-action models (VLAs) on top of them to readily translate multimodal understanding into low-level actions. Accordingly, recent work has explored enhancing embodied reasoning in MLLMs through supervision of vision-question-answering type. However, these approaches have been reported to result in unstable VLA performance, often yielding only marginal or even negative gains. In this paper, we propose a more systematic MLLM training framework RoboAlign that reliably improves VLA performance. Our key idea is to sample action tokens via zero-shot natural language reasoning and refines this reasoning using reinforcement learning (RL) to improve action accuracy. As a result, RoboAlign bridges the modality gap between language and low-level actions in MLLMs, and facilitate knowledge transfer from MLLM to VLA. To validate the effectiveness of RoboAlign, we train VLAs by adding a diffusion-based action head on top of an MLLM backbone and evaluate them on major robotics benchmarks. Remarkably, by performing RL-based alignment after SFT using less than 1\% of the data, RoboAlign achieves performance improvements of 17.5\%, 18.9\%, and 106.6\% over SFT baselines on LIBERO, CALVIN, and real-world environments, respectively.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.