Unified Spatiotemporal Token Compression for Video-LLMs at Ultra-Low Retention

cs.CV Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, Guo Lu · Mar 23, 2026

What it does

Video-LLMs struggle with high computational costs from massive visual token volumes (e. g.

Why it matters

The authors propose a unified selection mechanism combining attention weights and semantic similarity to identify high-contribution, low-redundancy tokens, plus a text-aware merging module for secondary compression inside the LLM. The...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Video-LLMs struggle with high computational costs from massive visual token volumes (e.g., 6,272 tokens for a 32-frame video). This paper challenges the standard two-stage spatiotemporal compression paradigm—which assumes spatial and temporal redundancy are separable—by reformulating compression as a global allocation problem. The authors propose a unified selection mechanism combining attention weights and semantic similarity to identify high-contribution, low-redundancy tokens, plus a text-aware merging module for secondary compression inside the LLM. The result is a training-free, plug-and-play method that retains ~90% performance with only 2% of tokens.

Critical review

Verdict

Bottom line

The paper presents a compelling and well-executed solution to ultra-low token retention in Video-LLMs. The unified spatiotemporal pool approach is intuitive and empirically validated, achieving state-of-the-art results at extreme compression ratios (2% retention). The ablation studies substantiate the design choices, particularly the combination of attention and similarity metrics. However, the theoretical justification for why unification outperforms staged approaches remains largely empirical, and the method relies on several fixed hyperparameters (thresholds τ and λ, clustering ratios) whose sensitivity is demonstrated but not theoretically grounded. The gains over strong baselines like HoliTom are consistent but sometimes marginal, suggesting the field is approaching saturation for this architecture class.

“Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%.”

paper · Abstract

“We reformulate video token compression as a spatiotemporal token allocation problem under a global constraint, aiming to select tokens that maximize informational contribution while minimizing semantic redundancy.”

paper · Section 3.1

What holds up

The core insight—that staged spatiotemporal compression leads to unbalanced allocation at ultra-low retention—is well-motivated and visually supported (Figure 1). The proposed mechanism integrating CLS-token attention scores with cosine similarity pruning (Eq. 2) effectively balances informativeness against redundancy, as evidenced by ablations showing similarity pruning alone improves performance by ~14% over attention-only selection. The text-aware merging module inside the LLM is a sophisticated addition that addresses positional bias from RoPE, using a weighted combination of cross-attention and text-token similarity (Eq. 9). The method demonstrates strong cross-backbone generalization (LLaVA-OneVision, LLaVA-Video, Qwen2.5-VL) and rigorous evaluation across five diverse benchmarks.

“incorporating similarity pruning improves model performance by about 14%”

paper · Section 4.3

“The approach leverages the attention distribution from text to visual tokens to identify and retain semantically relevant visual content. At the same time, it incorporates a semantic similarity measure between text and visual tokens to reduce positional sensitivity.”

paper · Section 3.3

Main concerns

Despite strong results, the method relies on several hand-tuned hyperparameters ($\tau=0.7$, $\lambda=0.5$, clustering ratio 0.3, layer $K=18$) whose optimal values are found via grid search (Figure 5) but lack principled selection criteria. While the paper claims unified compression avoids the 'implicit assumption of spatiotemporal separability,' the experiments do not isolate this factor from the specific combination of attention and similarity metrics used. The performance gap over HoliTom shrinks at higher retention ratios (Table 4: 99.6% vs 99.5% at 15%), suggesting the unified approach provides diminishing returns when token budgets are less constrained. Additionally, at 1% retention, performance drops to ~84%, which may be unacceptable for precision-critical applications, and the 'process' time overhead (74ms vs FastVID's 8.6ms) indicates non-trivial pre-processing costs.

“Ablation Experiment Results for Each Parameter”

paper · Figure 5

“Ours: 57.9 (100.9%), HoliTom: 57.3 (99.8%) at 15% retention”

paper · Table 4

“Process time: Ours 74.0ms, FastVID 8.6ms, HoliTom 88.7ms”

paper · Table 6

Evidence and comparison

The evidence broadly supports the central claim of achieving SOTA at ultra-low retention ratios. The comparisons against FastV, VisionZip, FastVID, and HoliTom are fair, using identical backbones (LLaVA-OneVision-7B) and evaluation protocols (LMMs-Eval). The 90.1% performance retention at 2% tokens (vs. HoliTom's 87.7%) and 84.1% at 1% (vs. 82.9%) are statistically meaningful across multiple benchmarks. However, the comparison to FastVID at 2% (83.3% retention) is less favorable to the baseline than the paper suggests, as FastVID was designed for different retention regimes. The ablation studies (Table 5) clearly demonstrate that both attention-based selection and similarity-based pruning are necessary, but they do not conclusively prove that the 'global pool' aspect (as opposed to sequential filtering) is the primary driver of improvement over HoliTom.

“Ours: 50.7 Avg (90.1%), HoliTom: 49.4 Avg (87.7%) at 2% retention”

paper · Table 1

“FastVID [24] achieves only 83.3% of its original performance at a 2% retention ratio”

paper · Introduction

“Ablation showing attention only: 41.8%, similarity only: 49.2%, both: 49.5%, both+clustering: 50.4%”

paper · Table 5

Reproducibility

The method is described with sufficient algorithmic detail (equations 1-10) to permit reproduction, including the DPC-KNN clustering procedure and the text-aware merging mechanism. The authors specify exact hyperparameters ($\tau=0.7$, clustering ratio 0.3, $\lambda=0.5$, $K=18$, $R=50\%$) and use standard public benchmarks (MVBench, VideoMME, etc.) and open-source models (LLaVA-OneVision, Qwen2.5-VL). However, the paper does not mention code availability or provide a supplementary code repository link in the provided text. Reproduction would require implementing the custom clustering and merging logic, and the performance is sensitive to the exact threshold settings shown in Figure 5, which may vary across different video distributions or model checkpoints not tested here.

“Parameters of the method are set as follows: the token similarity threshold $\tau$ is 0.7, and the clustering ratio is 0.3... preserving the top $R=50\%$ of visual tokens, and the $\lambda$ is 0.5”

paper · Section 4.1

“tokens assigned to the recycle pool are merged using DPC-KNN”

paper · Section 3.2

Abstract

Video large language models (Video-LLMs) face high computational costs due to large volumes of visual tokens. Existing token compression methods typically adopt a two-stage spatiotemporal compression strategy, relying on stage-specific metrics and an implicit assumption of spatiotemporal separability. Under extremely low retention ratios, however, such approaches often result in unbalanced allocation and loss of visual evidence essential for question answering. We reformulate token compression as a spatiotemporal allocation task within a global token retention pool. We propose a unified selection mechanism that integrates attention weights and semantic similarity to globally select tokens with high contribution and low redundancy. Unselected tokens are merged via clustering and refilled, preserving information integrity. Inside the LLM, we further introduce text-aware merging to perform secondary compression based on query relevance. Without requiring retraining, our method serves as a plug-and-play module compatible with existing Video-LLMs. Experiments show that retaining only about 2% of visual tokens preserves 90.1% of baseline performance across multiple benchmarks, while reducing FLOPs to roughly 2.6%. These benefits generalize across diverse backbones, decreasing end-to-end inference latency and memory consumption. Our unified spatiotemporal token compression strategy establishes the state-of-the-art in video understanding under ultra-low token retention.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.