Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation

cs.CV Jingnan Luo, Mingqi Gao, Jun Liu, Bin-Bin Gao, Feng Zheng · Mar 23, 2026

What it does

Why it matters

The core innovation is bidirectional text-trajectory alignment, where the model learns both text-to-trajectory grounding and trajectory-to-text captioning, alongside a Frame-level Content Integration (FCI) module and a unified mask decoder...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses video reasoning segmentation—segmenting objects in videos based on complex human instructions—by proposing TrajSeg, a unified framework built on Multimodal Large Language Models (MLLMs). The core innovation is bidirectional text-trajectory alignment, where the model learns both text-to-trajectory grounding and trajectory-to-text captioning, alongside a Frame-level Content Integration (FCI) module and a unified mask decoder that eliminates the need for separate key-frame and tracking models. The work matters because it simplifies training pipelines and aims to improve trajectory perception in dynamic video contexts.

Critical review

Verdict

Bottom line

TrajSeg presents a technically sound and well-motivated approach to unifying video reasoning segmentation. The bidirectional alignment and unified mask generator are meaningful architectural contributions that appear to improve temporal consistency and reasoning capabilities. However, the paper overstates its achievements by claiming it 'outperforms all video reasoning segmentation methods on all metrics' (Abstract). This claim is contradicted by the results in Table I, where AL-Ref-SAM2 (using GPT-4) achieves higher $\mathcal{J}\&\mathcal{F}$ scores on Ref-YouTube-VOS (67.9 vs. 67.0) and Ref-DAVIS (74.2 vs. 69.6). While TrajSeg outperforms methods using comparable backbones (e.g., VISA-7B), the blanket superiority claim is inaccurate.

“outperforms all video reasoning segmentation methods on all metrics”

paper · Abstract

“AL-Ref-SAM2 ... 67.9 ... 74.2 ... 42.8 | Ours ... 67.0 ... 69.6 ... 48.7”

paper · Table I

What holds up

The technical contributions are well-supported by ablations. The Frame-level Content Integration (FCI) module demonstrably improves performance, with the full model (Bi-Align + FCI) achieving $\mathcal{J}\&\mathcal{F}$ of 55.0/45.5 on ReVOS referring/reasoning versus 53.0/42.8 without either component (Table III). The unified mask generator successfully enables end-to-end training, showing that even with a single key frame the model achieves strong temporal consistency (Avg-IoU$_{t\leftrightarrow t+1}$ of 67.9, Table V). The trajectory-to-text captioning task (Figure 5) effectively leverages MLLM capabilities to enhance trajectory understanding without requiring inference-time supervision.

“Bi-Align ✓, FCI ✓ | 55.0 | 53.1 | 57.0 | 45.5 | 43.5 | 47.4”

paper · Table III

“KF 1 | 67.9 | 3.2 | 50.9”

paper · Table V

“The caption-style task in the trajectory-to-text direction is only used during the training stage to enhance the model's understanding of trajectories.”

paper · Section III-B

Main concerns

The primary concern is the discrepancy between the claimed universal superiority and the empirical results against GPT-4-based methods. Additionally, the ablation interpretation in Section IV-C contains questionable assertions: the text claims bidirectional alignment 'degrades the referring ones,' yet Table III shows improvements (54.4 vs. 53.0 without Bi-Align, and 55.0 vs. 53.5 with FCI). The failure case analysis (Figure 6c) is minimal, mentioning only ambiguous instructions without discussing fundamental limitations like error propagation in long videos or computational costs. The conclusion acknowledges being 'limited by the quantity of sampled frames,' but this constraint is not quantified in the experiments.

“It is observed that bidirectional alignment works better on the reasoning subset and degrades the referring ones.”

paper · Section IV-C

“TrajSeg is limited by the quantity of sampled frames, hindering further improvement.”

paper · Section V

Evidence and comparison

The evidence covers both referring (Ref-YouTube-VOS, Ref-DAVIS, MeViS) and reasoning (ReVOS) benchmarks, which is comprehensive. However, comparisons mix methods with vastly different computational budgets—TrajSeg uses LLaVA-7B while AL-Ref-SAM2 uses GPT-4, making direct metric comparisons misleading without normalization or computational cost analysis. The paper correctly notes that TrajSeg outperforms VISA variants with similar or larger backbones (VISA-13B) on ReVOS reasoning tasks (46.1 vs. 44.2 $\mathcal{J}\&\mathcal{F}$), supporting the claim that bidirectional alignment specifically benefits reasoning. However, comparisons to specialized RVOS methods like ReferDINO are less favorable, suggesting the unified MLLM approach trades some accuracy for generality.

“VISA ... LLaVA-13B ... 44.2 | Ours ... LLaVA-7B ... 46.1”

paper · Table II

“ReferDINO ... 69.3 ... 68.9 ... 49.3 | Ours ... 67.0 ... 69.6 ... 48.7”

paper · Table I

Reproducibility

The paper states that 'The code will be publicly available at https://github.com/haodi19/TrajSeg' (Abstract), but as of the review date, reproducibility depends on future code release. Training details are moderately comprehensive: the two-stage pipeline (image pre-training then video fine-tuning), optimizer (AdamW, lr=0.0003), batch size (80 on 4 GPUs), and loss weights ($\lambda_{\text{text}}=1.0$, $\lambda_{\text{bce}}=2.0$) are provided. However, critical hyperparameters for the LoRA adaptation (rank, alpha) are unspecified, and the exact data sampling strategy for pseudo-videos lacks detail. The reliance on proprietary initialization (LLaVA-7B, SAM-2 decoder) is standard but noted.

“The code will be publicly available at https://github.com/haodi19/TrajSeg”

paper · Abstract

“The trainable components ... include the MLLM with LoRA ... learning rate and weight decay of 0.0003 and 0”

paper · Section IV-A

Abstract

The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.