ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models

cs.LG Xu Li, Yi Zheng, Yuxuan Liang, Zhe Liu, Xiaolei Chen, Haotian Chen, Rui Zhu, Xiangyang Xue · Mar 22, 2026

What it does

Why it matters

To align selection with user queries, it modulates these residuals by a text relevance score computed via cosine similarity with embedded nouns from the prompt. This yields a training-free, plug-in method that preserves semantic coverage...

Main concern

ResPrune presents a principled geometric alternative to attention-based token pruning, supported by strong empirical gains across LLaVA-1. 5, LLaVA-NeXT, and Qwen2.

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Large Vision-Language Models (LVLMs) suffer from quadratic self-attention costs when processing high-resolution images that generate thousands of visual tokens. ResPrune addresses this by formulating token pruning as a subspace reconstruction problem: it greedily selects tokens that maximize residual energy (the orthogonal component unexplained by the current subset) in the LLM input embedding space. To align selection with user queries, it modulates these residuals by a text relevance score computed via cosine similarity with embedded nouns from the prompt. This yields a training-free, plug-in method that preserves semantic coverage while reducing compute.

Critical review

Verdict

Bottom line

ResPrune presents a principled geometric alternative to attention-based token pruning, supported by strong empirical gains across LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL. The proposed greedy subspace expansion consistently outperforms 10+ recent baselines, achieving 99.3% relative performance while pruning 77.8% of tokens on LLaVA-1.5-7B. However, the method relies on manually tuned hyperparameters for text guidance strength and lacks released code, which limits immediate reproducibility and deployment flexibility.

What holds up

The core insight—that token utility is context-dependent and should be measured by residual energy $E_S(\mathbf{v}) = \|\mathbf{v} - P_S\mathbf{v}\|_2^2$ within a progressively expanded subspace—is well-motivated and empirically validated. The ablation in Table IV is particularly compelling: removing subspace reconstruction and using only single-pass text-relevance ranking (Setting-3) causes performance to collapse from 98.4% to 82.4% at a 77.8% pruning ratio, confirming that geometric coverage is essential. The text-conditioning mechanism $g(r) = r^\alpha$ provides consistent incremental gains (Table IV, Setting-2 vs. Full) without requiring LLM forward passes, preserving compatibility with FlashAttention.

“Setting-3 ... 82.4 ... Full Method ... 98.4”

paper · Table IV

“$\widetilde{E}_{S}(\mathbf{v}_{i})=E_{S}(\mathbf{v}_{i})\cdot g\!\left(R(\mathbf{v}_{i},\mathbf{U})\right)$”

paper · Section III-D

Main concerns

First, the surrogate objective in Eq. (9) minimizes reconstruction error $\|\mathbf{V}-P_S\mathbf{V}\|_F^2$ in the embedding space, but the paper provides no theoretical or empirical justification that minimizing this Frobenius norm preserves downstream task performance (Eq. 6); it is simply assumed as a tractable proxy. Second, the text guidance strength $\alpha$ requires manual tuning per model family ($\alpha=0.75$ for LLaVA vs. $\alpha=0.3$ for Qwen2.5-VL), and the authors acknowledge this fixed global value cannot adapt to per-sample instruction specificity. Third, seed token selection is architecture-dependent (CLS attention for LLaVA, $\ell_2$ norm for Qwen), which introduces an implementation fragility when porting to new architectures lacking CLS tokens.

“the strength of textual guidance is currently controlled by a fixed hyperparameter $\alpha$, which is manually tuned for different model families and remains constant across inputs”

paper · Section V

“Directly optimizing Eq. (6) is intractable in practice”

paper · Section III-A

“For LLaVA-1.5 and LLaVA-NeXT, we set ... $\alpha=0.75$ ... For Qwen2.5-VL ... reduce ... to $\alpha=0.3$”

paper · Section IV-A1

Evidence and comparison

The evidence strongly supports the claim that ResPrune outperforms existing methods under aggressive pruning ratios (88.9%), where it maintains 98.0% relative performance on LLaVA-1.5 versus 96.8% for the next best method (SCOPE) and 94.3% for DART (Table I). Comparisons appear fair: all methods are evaluated under identical token budgets on the same nine benchmarks. However, the efficiency claims in Table VIII compare ResPrune against the full-token baseline but do not include wall-time comparisons against other pruning methods like FastV or SparseVLM that require partial LLM inference; while ResPrune’s $\mathcal{O}(TLd + kTd)$ preprocessing is lightweight, a head-to-head latency comparison against all baselines would strengthen the efficiency argument.

“ResPrune (Ours) ... 98.0 ... SCOPE ... 96.8 ... DART ... 94.3 ... pruning ratio = 88.9%”

paper · Table I

“overall time complexity of ResPrune is $\mathcal{O}(TLd)+\mathcal{O}(kTd+k^{2}d)$”

paper · Section III-F

Reproducibility

The paper describes implementation details (Algorithm 1, hyperparameters, Spacy model en\_core\_web\_sm for noun extraction) but does not provide a code repository link or reference to supplementary material containing code. Critical hyperparameters—$\alpha \in \{0.3, 0.75\}$, $\epsilon$ for numerical stability, and the exact regex patterns for text cleaning—are specified, but random seeds and full preprocessing scripts are not detailed. Reproduction would require reimplementing the Gram–Schmidt orthogonalization update (line 18 of Algorithm 1) and the residual energy tracking, which, while feasible from the pseudocode, introduces potential for implementation drift without released code.

“regex to remove instruction formatting ... en\_core\_web\_sm ... to extract noun tokens”

paper · Section IV-A1

“Update $\mathbf{Q}$ via Gram–Schmidt orthogonalization”

paper · Algorithm 1

Abstract

Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can be seamlessly integrated into existing LVLM pipelines without retraining or architectural modifications. Extensive experiments on multiple LVLM backbones, including LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL, demonstrate that ResPrune consistently outperforms existing pruning approaches across a wide range of benchmarks, while achieving effective reductions in computation, memory consumption, and inference latency.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.