DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

cs.CV cs.RO Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Vision-Language-Action models excel at direct visuomotor mapping but struggle with tasks requiring both fine-grained 3D spatial understanding and long-horizon logical planning. DualCoT-VLA proposes a parallel dual-stream reasoning mechanism that processes visual Chain-of-Thought for spatial perception and linguistic Chain-of-Thought for task planning simultaneously in latent space, using learnable query tokens to bypass autoregressive decoding and achieve single-step inference.

Critical review

Verdict

Bottom line

DualCoT-VLA presents a compelling technical solution to the modality-isolation and latency problems inherent in existing CoT-based VLA models. The method achieves state-of-the-art results on LIBERO (98.8%) and RoboCasa GR1 benchmarks while reducing inference latency from over 3 seconds to 83.2 ms compared to autoregressive alternatives. However, the heavy reliance on frozen auxiliary teachers (Depth Anything 3 and Qwen3-0.6B) during training, limited real-world evaluation scope (only three tasks), and unsubstantiated claims regarding "collapse" avoidance compared to concurrent work temper the generalizability claims.

“while autoregressive CoT increases VLM forward time to 3156.0 ms, our DualCoT-VLA increases VLM inference time by only 4.4 ms over a Non-CoT baseline (58.1 ms vs. 53.7 ms)”

paper · Section 4.4

“DualCoT-VLA (Our) achieves 98.8% average success rate on LIBERO”

paper · Table 1

What holds up

The parallel latent reasoning mechanism is rigorously validated. The dual-stream architecture successfully disentangles spatial perception from logical planning: the ablation study confirms that visual CoT drives performance on the Spatial suite (99.4%) while linguistic CoT specifically improves long-horizon tasks (Long suite improves from 92.0% to 98.2% when both are combined). The qualitative probe visualization (Figure 3) demonstrates that the compressed hidden states ($M=16$ tokens) actually encode meaningful geometric structure. The latency analysis is particularly convincing, showing that parallel CoT adds only 4.4 ms overhead versus non-CoT baselines while avoiding the compounding errors of autoregressive generation.

“the model trained with the Visual-only CoT significantly improves performance on visually demanding tasks, reaching 99.4% on the Spatial suite. The model trained with Linguistic-only CoT specifically enhances the performance on the Long suite”

paper · Section 4.5

“visual CoT query tokens $\mathbf{Q}_{\text{vis}}\in\mathbb{R}^{M\times d_{\text{VLM}}}$ (where $M=16$), and a set of linguistic CoT query tokens $\mathbf{Q}_{\text{lin}}\in\mathbb{R}^{N\times d_{\text{VLM}}}$ (where $N=4$)”

paper · Section 3.2

Main concerns

The method relies on frozen auxiliary models (Depth Anything 3 and Qwen3-0.6B) for distillation during training, creating a brittle dependency on specific external representations that may not transfer across visual domains or model updates. The real-world evaluation is preliminary—only three tabletop manipulation tasks with 100 demonstrations each—insufficient to support claims about "robust task planning" in unstructured environments. While the paper distinguishes itself from concurrent LaRA-VLA by claiming to avoid "collapse and retain the ability to decode these latent tokens into explicit text," no empirical evidence or direct comparison validates this claim. Additionally, the visual CoT visualization relies on a separately trained "lightweight visual probe," raising concerns about whether the tokens naturally encode spatial structure or if the probe extracts it post-hoc.

“Unlike LaRA-VLA, our method maintains supervision on the implicit textual CoT, avoiding collapse and retaining the ability to decode these latent tokens into explicit text during inference”

paper · Section 2.1

“We design three tabletop manipulation tasks of increasing complexity... For each task, we collect 100 human-teleoperated demonstrations for fine-tuning”

paper · Section 4.1

“we train a lightweight visual probe to map the compressed hidden states of the visual query tokens into depth maps”

paper · Section 4.3

Evidence and comparison

The quantitative evidence supports the primary claims: DualCoT-VLA achieves 98.8% on LIBERO (outperforming LaRA-VLA's 97.9% and GR00T-N1.6's 97.0%) and 55.1% on RoboCasa GR1 (next best is 48.8%). However, the comparison with concurrent LaRA-VLA is limited to a single accuracy metric without analysis of failure modes or statistical significance. The claim that explicit CoT suffers from "severe information redundancy" (Section 1) is asserted but not quantified against the proposed implicit approach. The paper does not provideconfidence intervals or variance across seeds, making it difficult to assess whether the gains (e.g., 98.8% vs 97.9% on LIBERO) are statistically meaningful.

“explicit CoT inherently suffers from severe information redundancy and requires time-consuming multi-step decoding”

paper · Section 1

“LaRA-VLA (bai2026latent) achieves 97.9% average vs DualCoT-VLA 98.8%”

paper · Table 1

Reproducibility

Implementation details are reasonably thorough: the paper specifies hyperparameters (learning rates 2.5e-5 and 3e-5, batch sizes 48 and 256, $\lambda_{\text{vis}}=0.1$, $\lambda_{\text{lin}}=0.1$, $\lambda_{\text{act}}=1.0$) and model architectures (Qwen3-VL-4B backbone, Flow-Matching DiT head). However, there is no mention of code release or data availability. Reproduction requires the specific auxiliary models (Depth Anything 3, Qwen3-0.6B) and the generated CoT annotations for RoboCasa (created via Qwen3-VL-32B prompting), which may not be publicly available. The exact prompts used to generate the three-part CoT structure (state tracking, spatial location, action formulation) are not provided, which could significantly impact reproducibility.

“we set $\lambda_{vis}$ to 0.1, $\lambda_{lin}$ to 0.1, and $\lambda_{act}$ to 1.0... For task-specific hyperparameters, models on LIBERO are trained with a learning rate of 2.5e-5... For the RoboCasa GR1, we use a learning rate of 3e-5”

paper · Section 4.1

“For the RoboCasa GR1 benchmark, we use Qwen3-VL-32B to autonomously generate dense CoT annotations”

paper · Section 4.1

Abstract

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.