VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Long video understanding remains challenging for multimodal large language models due to limited context windows. VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal continuity. The framework propagates query relevance through an iterative hypothesis–verification–refinement loop, enabling sparse but informed sampling of critical segments for question answering.
The paper presents a well-motivated framework that convincingly demonstrates accuracy gains across multiple MLLM backbones. The integration of graph-based manifold propagation with active inference is novel for long-video QA, and the ablations rigorously isolate component contributions. However, the reliance on VLM self-reflection for verification feedback and the use of estimated (rather than measured) token costs for proprietary baselines introduce uncertainty regarding robustness and efficiency claims.
The core insight—that intrinsic video structure should inform relevance propagation alongside extrinsic query matching—is sound and well-executed. The ablation in Table 3 validates each component: removing graph propagation drops accuracy by 4.2%, while removing facet decomposition causes a severe 7.8% degradation, confirming that blind similarity propagation introduces noise. The cross-backbone consistency (Figure 2) demonstrates genuine plug-and-play capability, with gains ranging from 4.2% to 7.5% across architectures.
The token efficiency comparison (Figure 3) relies on estimated lower bounds for proprietary models ($\sim 10^5$ tokens) rather than measured API consumption, explicitly excluding text prompts and system instructions (Appendix E.3), which may substantially undercount actual usage. The method critically depends on the VLM's ability to reliably emit \"missing keywords\" feedback during verification—a brittle assumption for ambiguous visual content. Additionally, graph construction relies on fixed thresholds ($\theta_{\mathrm{sim}} = 0.82$) and heuristic sparsification (top-$k=8$) that may not transfer across video genres without tuning.
Comparisons to VideoRAG, VideoAgent, DVD, and LVNet in Table 1 are fair, using identical backbones (Qwen3-VL-8B, SeedVL-1.5) and fixed 32-frame budgets. The claim of outperforming GPT-4o and Gemini-1.5-Pro on LongVideoBench (67.9% vs 66.7% and 64.0%) is supported by Table 2, though the restriction to 32 frames for VideoDetective versus 384/256 frames for proprietary models complicates direct capability comparison. The modality scaling analysis (Table 4) revealing that VLM upgrades (+9.5%) matter more than LLM upgrades (+0.2%) is a valuable empirical insight.
Code is provided at https://videodetective.github.io/, and hyperparameters are exhaustively documented in Appendix E (Tables 5–9). However, exact prompts for the LLM planner and VLM observer (Appendix D) use placeholder JSON schemas that omit specific in-context examples, and the IDF corpus for lexical scoring is unspecified. The retry mechanism (Table 9) and temperature settings (0.0) suggest sensitivity to API-side stochasticity. Reproducing the proprietary model baselines is hindered by the lack of exact API version timestamps and the estimated nature of their token counts.
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.