Rethinking Token Reduction for Large Vision-Language Models

cs.CV cs.AI Yi Wang, Haofei Zhang, Qihan Huang, Anda Cao, Gongfan Fang, Wei Wang, Xuan Jin, Jie Song, Mingli Song, Xinchao Wang · Mar 23, 2026

What it does

Why it matters

The paper proposes a learning-based prompt-agnostic compression module trained via KL divergence minimization between original and compressed outputs, demonstrating that heuristic attention-based pruning is suboptimal for this scenario....

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

MetaCompress addresses token reduction for multi-turn VQA in Large Vision-Language Models, where future questions are unpredictable and may target any image region. The paper proposes a learning-based prompt-agnostic compression module trained via KL divergence minimization between original and compressed outputs, demonstrating that heuristic attention-based pruning is suboptimal for this scenario. The method achieves strong efficiency-accuracy trade-offs across five LVLM architectures while training on only ~20k samples.

Critical review

Verdict

Bottom line

The paper presents a compelling solution to an important practical problem—reducing visual tokens in multi-turn VQA scenarios where prompt-dependent methods fail. The empirical demonstration that learned compression matrices do not correlate with attention scores provides strong evidence against existing heuristic approaches. While the core idea is solid and the results are promising, the evaluation relies partly on comparisons with methods not designed for multi-turn settings, and the approach is limited to pre-decoder compression only.

“the vast majority of the retained tokens are unrelated to their attention scores, especially with regards to the attention to the language prompts”

paper · Section 4

“we apply our reduction module only before the LLM decoder, although MetaCompress can in principle be inserted at any layer”

paper · Section 5.1

What holds up

The unification of pruning and merging under the linear compression formulation $\tilde{X}_{\text{IMG}} = P X_{\text{IMG}}$ with $P \in \mathbb{R}_{+}^{m \times n}$ provides a clean theoretical framework for learning token reduction. The finding that tokens retained by the learned matrix exhibit no obvious relationship with heuristic attention cues (e.g., [CLS] token or prompt attention) directly validates the authors' claim that existing prompt-agnostic heuristics are suboptimal. The data-efficient training paradigm—requiring only about 20k samples and 30 GPU hours—along with compatibility with dynamic-resolution multi-scale vision towers (LLaVA-NeXT, XComposer-2.5) demonstrates strong practical applicability.

“the tokens retained by the learned matrix do not exhibit an obvious relationship with the heuristic attention cues commonly used in prior methods”

paper · Section 4

“training of LLaVA-NeXT-7B with a 90% reduction rate takes approximately 30 GPU hours”

paper · Section 6.1

Main concerns

The evaluation strategy relies heavily on comparing against FastV—a prompt-dependent method explicitly designed for single-turn VQA—in a multi-turn setting where it is expected to fail, potentially inflating the perceived performance gap. The method is restricted to compression before the LLM decoder, leaving significant efficiency gains from intermediate-layer compression unrealized. Furthermore, the training requires careful stabilization: the collapse regularization $\mathcal{L}_{\text{collapse}}$ causes divergence without gradient clipping when reduction rates exceed 70%, indicating optimization instability at high compression ratios.

“FastV performs significantly worse than both the Sample and even the Random methods”

paper · Section 6.2

“training utilizing the $\mathcal{L}_{\text{collapse}}$ alone leads to divergence because of the relatively high penalty on the collapse objective, especially when the reduction rate is small (less than 70%)”

paper · Table 3

Evidence and comparison

The evidence strongly supports the claim that attention scores are suboptimal guidance for MT-VQA token reduction, as Figure 1 shows retained tokens under the learned matrix do not correlate with [CLS] or prompt attention distributions. However, the comparison framework is uneven: FastV's poor performance is predictable given its prompt-dependent design, while the lack of comparison against other learning-based or task-specific multi-turn methods limits the assessment of true state-of-the-art performance. The comparison with PruMerge is more appropriate but constrained by its incompatibility with multi-scale vision towers, restricting fair evaluation to LLaVA-1.5 only.

“Despite that some tokens with high attention to the [CLS] token are retained... the vast majority of the retained tokens are unrelated to their attention scores”

paper · Section 4

“LLaVA-PruMerge... is not compatible with the multi-scale visual tower”

paper · Section 6.1

Reproducibility

The paper provides detailed implementation specifics including hyperparameters (SGD with lr=$10^{-3}$, $\alpha_{\text{entropy}}=\alpha_{\text{collapse}}=1$, gradient clipping $10^{-2}$), architectures (LLaVA-1.5, LLaVA-NeXT, XComposer-2.5), and training data (20k samples from MT-GQA and MT-VQA-v2). Code is publicly available. However, reproduction requires significant compute (4×A6000 GPUs) and the meta-generator must be retrained for each LVLM architecture and reduction rate. The theoretical analysis assumes $W_q = W_k$ initialization that breaks during training, potentially complicating verification of the "weighted pooling" initialization property described in Section 5.2 and Section 9.

“optimize the proposed MetaCompress with SGD... learning rate of $10^{-3}$... Gradient clipping is adopted with a maximum value of $10^{-2}$... train all the settings for 2 epochs with a batch size of 36 on four commercial NVIDIA RTX A6000 GPUs”

paper · Section 6.1

“with a specific initialization (i.e., $W_q = W_k$), MetaCompress will initially behave as a weighted pooling”

paper · Section 5.2

Abstract

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn Visual Question Answering (VQA), leaving the more practical multi-turn VQA (MT-VQA) scenario largely unexplored. MT-VQA introduces additional challenges, as subsequent questions are unknown beforehand and may refer to arbitrary image regions, making existing reduction strategies ineffective. Specifically, current approaches fall into two categories: prompt-dependent methods, which bias toward the initial text prompt and discard information useful for subsequent turns; prompt-agnostic ones, which, though technically applicable to multi-turn settings, rely on heuristic reduction metrics such as attention scores, leading to suboptimal performance. In this paper, we propose a learning-based prompt-agnostic method, termed MetaCompress, overcoming the limitations of heuristic designs. We begin by formulating token reduction as a learnable compression mapping, unifying existing formats such as pruning and merging into a single learning objective. Upon this formulation, we introduce a data-efficient training paradigm capable of learning optimal compression mappings with limited computational costs. Extensive experiments on MT-VQA benchmarks and across multiple LVLM architectures demonstrate that MetaCompress achieves superior efficiency-accuracy trade-offs while maintaining strong generalization across dialogue turns. Our code is available at https://github.com/MArSha1147/MetaCompress.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.