QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression

cs.CV cs.AI Zhongyang Li, Yaqian Li, Faming Fang, Rinyoichi Takezoe, Zi-Hao Bo, Cheng Qian, Mo Guang, Guixu Zhang, Kaiwen Long · Mar 22, 2026
Local to this browser
What it does
QMoP tackles the computational bottleneck in multimodal LLMs caused by excessive visual tokens, which dwarf text tokens in memory and compute costs. The paper proposes a Query Guided Mixture-of-Projector that dynamically combines three...
Why it matters
This adaptive approach matters because fixed compression rules inherently sacrifice different information types (global context vs. local details) depending on the task.
Main concern
The paper presents a well-motivated approach that achieves meaningful efficiency gains with minimal accuracy loss. The Query Guided Router effectively balances trade-offs between compression paradigms, though the mechanism relies on...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

QMoP tackles the computational bottleneck in multimodal LLMs caused by excessive visual tokens, which dwarf text tokens in memory and compute costs. The paper proposes a Query Guided Mixture-of-Projector that dynamically combines three compression strategies—pooling for global semantics, resampling for high-level features, and pruning for fine-grained details—via a learned router. This adaptive approach matters because fixed compression rules inherently sacrifice different information types (global context vs. local details) depending on the task.

Critical review
Verdict
Bottom line

The paper presents a well-motivated approach that achieves meaningful efficiency gains with minimal accuracy loss. The Query Guided Router effectively balances trade-offs between compression paradigms, though the mechanism relies on relatively simple features (CLS and EOS tokens) and the improvement margins over the best single-strategy baselines (e.g., TokenPacker) remain modest (100.73% vs 100.05% on LLaVA-1.5-7B).

“QMoP (ours) ... 100.73% ... TokenPacker ... 100.05%”
paper · Table 1
“we extract the class token $v_{CLS}\in\mathbb{R}^{C_{1}}$ from the penultimate layer of the vision encoder ... the input text $I_{text}$ is fed into a pre-trained CLIP text encoder, from which we extract the EOS token $t_{EOS}\in\mathbb{R}^{C_{2}}$”
paper · Section 3.2
What holds up

The empirical justification for the three-branch design is compelling—Figure 2 and Table 4 demonstrate that different compression strategies excel at distinct tasks (pooling for global understanding, pruning for local details, resampling for style/emotion). The efficiency analysis substantiates significant computational savings: at 144 tokens, QMoP uses only 25% of the baseline FLOPs and KV cache while maintaining 100.73% of the uncompressed model's average performance (Table 6). The VTCBench fills a genuine evaluation gap by isolating compression-induced degradation across five interpretable dimensions.

“Tokens 144 ... TFLOPs 0.94 ... KVcache 75.5M ... Performance 100.73%”
paper · Table 6
“Pooling excels on global understanding, pruning leads on local details, and resampling performs best on style and emotion”
paper · Figure 2
Main concerns

The router relies solely on CLS and EOS tokens for routing decisions, which may lack granularity for fine-grained visual tasks requiring attention to specific spatial regions. Table 4 reveals that naively combining all branches without selection (A+B+C) degrades performance to 60.2% versus QMoP's 61.8% on VTCBench, indicating significant noise when branches are indiscriminately merged, yet the paper offers limited analysis of the router's learned behavior or interpretability beyond the gating weights. The VTCBench construction depends on GPT-4 for sample classification, which may perpetuate biases from the judge model and lacks inter-annotator agreement metrics.

“A+B+C ... 60.2 ... QMoP ... 61.8”
paper · Table 4
“we employ an automatic filtering pipeline using Azure GPT-4.1 as the multimodal judge”
paper · Supplementary Section 1.2
Evidence and comparison

The evidence supports the central claim that adaptive compression outperforms fixed strategies, with QMoP achieving the best average scores on VTCBench across all five dimensions (61.8% vs 60.5% for the best baseline). However, comparisons with intra-LLM compression methods (Table 2) use different token counts (144 vs 192), potentially favoring QMoP, and the paper does not establish statistical significance for the small margins observed on standard benchmarks (e.g., 0.68% improvement over TokenPacker). The comparison to LLaVA-Prumerge and other pruning methods shows QMoP generally wins on aggregate metrics but does not dominate on every individual benchmark.

“QMoP (Ours) ... 144 ... DART ... 192”
paper · Table 2
“QMoP ... 61.8 ... TokenPacker ... 60.5”
paper · Table 3
Reproducibility

While the paper specifies the base architecture (LLaVA-1.5-7B/13B, Vicuna-7B, CLIP-ViT-Large-Patch14-336) and training datasets (LAION-CC-SBU-558K pretraining, 665K mixed instruction tuning), critical hyperparameters for the two-stage training—including learning rates, warmup steps, batch sizes, and the temperature annealing schedule for the router—are omitted from the main text. The paper mentions Gumbel noise injection and temperature reduction without quantitative specifics (Section 3.4). Code and the VTCBench dataset are not publicly released as of this review, blocking independent reproduction.

“We employ Vicuna-7B as the LLM and CLIP-ViT-Large-Patch14-336 as the visual encoder”
paper · Section 4.1
“As training proceeds, $\tau$ is gradually decreased to sharpen the expert selection. To enhance diversity ... Gumbel noise is introduced during training”
paper · Section 3.4
Abstract

Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.