VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

cs.CV Ruoliu Yang, Chu Wu, Caifeng Shan, Ran He, Chaoyou Fu · Mar 23, 2026
Local to this browser
What it does
Long video understanding remains challenging for multimodal large language models due to limited context windows. VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal...
Why it matters
VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal continuity. The framework propagates query relevance through an iterative hypothesis–verification–refinement...
Main concern
The paper presents a well-motivated framework that convincingly demonstrates accuracy gains across multiple MLLM backbones. The integration of graph-based manifold propagation with active inference is novel for long-video QA, and the...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Long video understanding remains challenging for multimodal large language models due to limited context windows. VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal continuity. The framework propagates query relevance through an iterative hypothesis–verification–refinement loop, enabling sparse but informed sampling of critical segments for question answering.

Critical review
Verdict
Bottom line

The paper presents a well-motivated framework that convincingly demonstrates accuracy gains across multiple MLLM backbones. The integration of graph-based manifold propagation with active inference is novel for long-video QA, and the ablations rigorously isolate component contributions. However, the reliance on VLM self-reflection for verification feedback and the use of estimated (rather than measured) token costs for proprietary baselines introduce uncertainty regarding robustness and efficiency claims.

“Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long.”
paper · Abstract
What holds up

The core insight—that intrinsic video structure should inform relevance propagation alongside extrinsic query matching—is sound and well-executed. The ablation in Table 3 validates each component: removing graph propagation drops accuracy by 4.2%, while removing facet decomposition causes a severe 7.8% degradation, confirming that blind similarity propagation introduces noise. The cross-backbone consistency (Figure 2) demonstrates genuine plug-and-play capability, with gains ranging from 4.2% to 7.5% across architectures.

“w/o Graph Propagation ... -4.2”
paper · Table 3
“w/o Facet Decomposition & Iterative Refinement ... -7.8”
paper · Table 3
Main concerns

The token efficiency comparison (Figure 3) relies on estimated lower bounds for proprietary models ($\sim 10^5$ tokens) rather than measured API consumption, explicitly excluding text prompts and system instructions (Appendix E.3), which may substantially undercount actual usage. The method critically depends on the VLM's ability to reliably emit \"missing keywords\" feedback during verification—a brittle assumption for ambiguous visual content. Additionally, graph construction relies on fixed thresholds ($\theta_{\mathrm{sim}} = 0.82$) and heuristic sparsification (top-$k=8$) that may not transfer across video genres without tuning.

“These estimates include only image tokens and exclude text prompts, system instructions, and other textual overhead.”
paper · Appendix E.3
“explicitly outputting \"missing keywords xx\" if the keywords xx in $\mathcal{K}_{r}$ are not observed”
paper · Section 3.3.2
Evidence and comparison

Comparisons to VideoRAG, VideoAgent, DVD, and LVNet in Table 1 are fair, using identical backbones (Qwen3-VL-8B, SeedVL-1.5) and fixed 32-frame budgets. The claim of outperforming GPT-4o and Gemini-1.5-Pro on LongVideoBench (67.9% vs 66.7% and 64.0%) is supported by Table 2, though the restriction to 32 frames for VideoDetective versus 384/256 frames for proprietary models complicates direct capability comparison. The modality scaling analysis (Table 4) revealing that VLM upgrades (+9.5%) matter more than LLM upgrades (+0.2%) is a valuable empirical insight.

“VideoDetective (SeedVL-1.5) ... 67.9 ... GPT-4o ... 66.7 ... Gemini-1.5-Pro ... 64.0”
paper · Table 2
“Scaling LLM ... +0.2 ... Scaling VLM ... +9.5”
paper · Table 4
Reproducibility

Code is provided at https://videodetective.github.io/, and hyperparameters are exhaustively documented in Appendix E (Tables 5–9). However, exact prompts for the LLM planner and VLM observer (Appendix D) use placeholder JSON schemas that omit specific in-context examples, and the IDF corpus for lexical scoring is unspecified. The retry mechanism (Table 9) and temperature settings (0.0) suggest sensitivity to API-side stochasticity. Reproducing the proprietary model baselines is hindered by the lack of exact API version timestamps and the estimated nature of their token counts.

“Visual-temporal fusion weight $\alpha$ ... 0.6 ... Temporal decay factor $\tau$ ... 30.0 ... Top-k sparsification ... 8”
paper · Appendix E.6
“Output JSON format: ... [base64 encoded placeholder]”
paper · Appendix D
Abstract

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.