Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention
Video-LLMs struggle with high computational costs from massive visual token volumes (e.g., 6,272 tokens for a 32-frame video). This paper challenges the standard two-stage spatiotemporal compression paradigm—which assumes spatial and temporal redundancy are separable—by reformulating compression as a global allocation problem. The authors propose a unified selection mechanism combining attention weights and semantic similarity to identify high-contribution, low-redundancy tokens, plus a text-aware merging module for secondary compression inside the LLM. The result is a training-free, plug-and-play method that retains ~90% performance with only 2% of tokens.
The paper presents a compelling and well-executed solution to ultra-low token retention in Video-LLMs. The unified spatiotemporal pool approach is intuitive and empirically validated, achieving state-of-the-art results at extreme compression ratios (2% retention). The ablation studies substantiate the design choices, particularly the combination of attention and similarity metrics. However, the theoretical justification for why unification outperforms staged approaches remains largely empirical, and the method relies on several fixed hyperparameters (thresholds τ and λ, clustering ratios) whose sensitivity is demonstrated but not theoretically grounded. The gains over strong baselines like HoliTom are consistent but sometimes marginal, suggesting the field is approaching saturation for this architecture class.
The core insight—that staged spatiotemporal compression leads to unbalanced allocation at ultra-low retention—is well-motivated and visually supported (Figure 1). The proposed mechanism integrating CLS-token attention scores with cosine similarity pruning (Eq. 2) effectively balances informativeness against redundancy, as evidenced by ablations showing similarity pruning alone improves performance by ~14% over attention-only selection. The text-aware merging module inside the LLM is a sophisticated addition that addresses positional bias from RoPE, using a weighted combination of cross-attention and text-token similarity (Eq. 9). The method demonstrates strong cross-backbone generalization (LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL) and rigorous evaluation across five diverse benchmarks.
Despite strong results, the method relies on several hand-tuned hyperparameters ($\tau=0.7$, $\lambda=0.5$, clustering ratio 0.3, layer $K=18$) whose optimal values are found via grid search (Figure 5) but lack principled selection criteria. While the paper claims unified compression avoids the 'implicit assumption of spatiotemporal separability,' the experiments do not isolate this factor from the specific combination of attention and similarity metrics used. The performance gap over HoliTom shrinks at higher retention ratios (Table 4: 99.6% vs 99.5% at 15%), suggesting the unified approach provides diminishing returns when token budgets are less constrained. Additionally, at 1% retention, performance drops to ~84%, which may be unacceptable for precision-critical applications, and the 'process' time overhead (74ms vs FastVID's 8.6ms) indicates non-trivial pre-processing costs.
The evidence broadly supports the central claim of achieving SOTA at ultra-low retention ratios. The comparisons against FastV, VisionZip, FastVID, and HoliTom are fair, using identical backbones (LLaVA-OneVision-7B) and evaluation protocols (LMMs-Eval). The 90.1% performance retention at 2% tokens (vs. HoliTom's 87.7%) and 84.1% at 1% (vs. 82.9%) are statistically meaningful across multiple benchmarks. However, the comparison to FastVID at 2% (83.3% retention) is less favorable to the baseline than the paper suggests, as FastVID was designed for different retention regimes. The ablation studies (Table 5) clearly demonstrate that both attention-based selection and similarity-based pruning are necessary, but they do not conclusively prove that the 'global pool' aspect (as opposed to sequential filtering) is the primary driver of improvement over HoliTom.
The method is described with sufficient algorithmic detail (equations 1-10) to permit reproduction, including the DPC-KNN clustering procedure and the text-aware merging mechanism. The authors specify exact hyperparameters ($\tau=0.7$, clustering ratio 0.3, $\lambda=0.5$, $K=18$, $R=50\%$) and use standard public benchmarks (MVBench, VideoMME, etc.) and open-source models (LLaVA-OneVision, Qwen2.5-VL). However, the paper does not mention code availability or provide a supplementary code repository link in the provided text. Reproduction would require implementing the custom clustering and merging logic, and the performance is sensitive to the exact threshold settings shown in Figure 5, which may vary across different video distributions or model checkpoints not tested here.
Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.