Learning Trajectory-Aware Multimodal Large Language Models for Video Reasoning Segmentation
This paper addresses video reasoning segmentation—segmenting objects in videos based on complex human instructions—by proposing TrajSeg, a unified framework built on Multimodal Large Language Models (MLLMs). The core innovation is bidirectional text-trajectory alignment, where the model learns both text-to-trajectory grounding and trajectory-to-text captioning, alongside a Frame-level Content Integration (FCI) module and a unified mask decoder that eliminates the need for separate key-frame and tracking models. The work matters because it simplifies training pipelines and aims to improve trajectory perception in dynamic video contexts.
TrajSeg presents a technically sound and well-motivated approach to unifying video reasoning segmentation. The bidirectional alignment and unified mask generator are meaningful architectural contributions that appear to improve temporal consistency and reasoning capabilities. However, the paper overstates its achievements by claiming it 'outperforms all video reasoning segmentation methods on all metrics' (Abstract). This claim is contradicted by the results in Table I, where AL-Ref-SAM2 (using GPT-4) achieves higher $\mathcal{J}\&\mathcal{F}$ scores on Ref-YouTube-VOS (67.9 vs. 67.0) and Ref-DAVIS (74.2 vs. 69.6). While TrajSeg outperforms methods using comparable backbones (e.g., VISA-7B), the blanket superiority claim is inaccurate.
The technical contributions are well-supported by ablations. The Frame-level Content Integration (FCI) module demonstrably improves performance, with the full model (Bi-Align + FCI) achieving $\mathcal{J}\&\mathcal{F}$ of 55.0/45.5 on ReVOS referring/reasoning versus 53.0/42.8 without either component (Table III). The unified mask generator successfully enables end-to-end training, showing that even with a single key frame the model achieves strong temporal consistency (Avg-IoU$_{t\leftrightarrow t+1}$ of 67.9, Table V). The trajectory-to-text captioning task (Figure 5) effectively leverages MLLM capabilities to enhance trajectory understanding without requiring inference-time supervision.
The primary concern is the discrepancy between the claimed universal superiority and the empirical results against GPT-4-based methods. Additionally, the ablation interpretation in Section IV-C contains questionable assertions: the text claims bidirectional alignment 'degrades the referring ones,' yet Table III shows improvements (54.4 vs. 53.0 without Bi-Align, and 55.0 vs. 53.5 with FCI). The failure case analysis (Figure 6c) is minimal, mentioning only ambiguous instructions without discussing fundamental limitations like error propagation in long videos or computational costs. The conclusion acknowledges being 'limited by the quantity of sampled frames,' but this constraint is not quantified in the experiments.
The evidence covers both referring (Ref-YouTube-VOS, Ref-DAVIS, MeViS) and reasoning (ReVOS) benchmarks, which is comprehensive. However, comparisons mix methods with vastly different computational budgets—TrajSeg uses LLaVA-7B while AL-Ref-SAM2 uses GPT-4, making direct metric comparisons misleading without normalization or computational cost analysis. The paper correctly notes that TrajSeg outperforms VISA variants with similar or larger backbones (VISA-13B) on ReVOS reasoning tasks (46.1 vs. 44.2 $\mathcal{J}\&\mathcal{F}$), supporting the claim that bidirectional alignment specifically benefits reasoning. However, comparisons to specialized RVOS methods like ReferDINO are less favorable, suggesting the unified MLLM approach trades some accuracy for generality.
The paper states that 'The code will be publicly available at https://github.com/haodi19/TrajSeg' (Abstract), but as of the review date, reproducibility depends on future code release. Training details are moderately comprehensive: the two-stage pipeline (image pre-training then video fine-tuning), optimizer (AdamW, lr=0.0003), batch size (80 on 4 GPUs), and loss weights ($\lambda_{\text{text}}=1.0$, $\lambda_{\text{bce}}=2.0$) are provided. However, critical hyperparameters for the LoRA adaptation (rank, alpha) are unspecified, and the exact data sampling strategy for pseudo-videos lacks detail. The reliance on proprietary initialization (LLaVA-7B, SAM-2 decoder) is standard but noted.
The prosperity of Multimodal Large Language Models (MLLMs) has stimulated the demand for video reasoning segmentation, which aims to segment video objects based on human instructions. Previous studies rely on unidirectional and implicit text-trajectory alignment, which struggles with trajectory perception when faced with severe video dynamics. In this work, we propose TrajSeg, a simple and unified framework built upon MLLMs. Concretely, we introduce bidirectional text-trajectory alignment, where MLLMs accept grounding-intended (text-to-trajectory) and captioning-intended (trajectory-to-text) instructions. This way, MLLMs can benefit from enhanced correspondence and better perceive object trajectories in videos. The mask generation from trajectories is achieved via a frame-level content integration (FCI) module and a unified mask decoder. The former adapts the MLLM-parsed trajectory-level token to frame-specific information. The latter unifies segmentation for all frames into a single structure, enabling the proposed framework to be simplified and end-to-end trainable. Extensive experiments on referring and reasoning video segmentation datasets demonstrate the effectiveness of TrajSeg, which outperforms all video reasoning segmentation methods on all metrics. The code will be publicly available at https://github.com/haodi19/TrajSeg.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.