QMoP: Query Guided Mixture-of-Projector for Efficient Visual Token Compression
QMoP tackles the computational bottleneck in multimodal LLMs caused by excessive visual tokens, which dwarf text tokens in memory and compute costs. The paper proposes a Query Guided Mixture-of-Projector that dynamically combines three compression strategies—pooling for global semantics, resampling for high-level features, and pruning for fine-grained details—via a learned router. This adaptive approach matters because fixed compression rules inherently sacrifice different information types (global context vs. local details) depending on the task.
The paper presents a well-motivated approach that achieves meaningful efficiency gains with minimal accuracy loss. The Query Guided Router effectively balances trade-offs between compression paradigms, though the mechanism relies on relatively simple features (CLS and EOS tokens) and the improvement margins over the best single-strategy baselines (e.g., TokenPacker) remain modest (100.73% vs 100.05% on LLaVA-1.5-7B).
The empirical justification for the three-branch design is compelling—Figure 2 and Table 4 demonstrate that different compression strategies excel at distinct tasks (pooling for global understanding, pruning for local details, resampling for style/emotion). The efficiency analysis substantiates significant computational savings: at 144 tokens, QMoP uses only 25% of the baseline FLOPs and KV cache while maintaining 100.73% of the uncompressed model's average performance (Table 6). The VTCBench fills a genuine evaluation gap by isolating compression-induced degradation across five interpretable dimensions.
The router relies solely on CLS and EOS tokens for routing decisions, which may lack granularity for fine-grained visual tasks requiring attention to specific spatial regions. Table 4 reveals that naively combining all branches without selection (A+B+C) degrades performance to 60.2% versus QMoP's 61.8% on VTCBench, indicating significant noise when branches are indiscriminately merged, yet the paper offers limited analysis of the router's learned behavior or interpretability beyond the gating weights. The VTCBench construction depends on GPT-4 for sample classification, which may perpetuate biases from the judge model and lacks inter-annotator agreement metrics.
The evidence supports the central claim that adaptive compression outperforms fixed strategies, with QMoP achieving the best average scores on VTCBench across all five dimensions (61.8% vs 60.5% for the best baseline). However, comparisons with intra-LLM compression methods (Table 2) use different token counts (144 vs 192), potentially favoring QMoP, and the paper does not establish statistical significance for the small margins observed on standard benchmarks (e.g., 0.68% improvement over TokenPacker). The comparison to LLaVA-Prumerge and other pruning methods shows QMoP generally wins on aggregate metrics but does not dominate on every individual benchmark.
While the paper specifies the base architecture (LLaVA-1.5-7B/13B, Vicuna-7B, CLIP-ViT-Large-Patch14-336) and training datasets (LAION-CC-SBU-558K pretraining, 665K mixed instruction tuning), critical hyperparameters for the two-stage training—including learning rates, warmup steps, batch sizes, and the temperature annealing schedule for the router—are omitted from the main text. The paper mentions Gumbel noise injection and temperature reduction without quantitative specifics (Section 3.4). Code and the VTCBench dataset are not publicly released as of this review, blocking independent reproduction.
Multimodal large language models suffer from severe computational and memory bottlenecks, as the number of visual tokens far exceeds that of textual tokens. While recent methods employ projector modules to align and compress visual tokens into text-aligned features, they typically depend on fixed heuristics that limit adaptability across diverse scenarios. In this paper, we first propose Query Guided Mixture-of-Projector (QMoP), a novel and flexible framework that adaptively compresses visual tokens via three collaborative branches: (1) a pooling-based branch for coarse-grained global semantics, (2) a resampler branch for extracting high-level semantic representations, and (3) a pruning-based branch for fine-grained token selection to preserve critical visual detail. To adaptively coordinate these branches, we introduce the Query Guided Router (QGR), which dynamically selects and weights the outputs from different branches based on both visual input and textual queries. A Mixture-of-Experts-style fusion mechanism is designed to aggregate the outputs, harnessing the strengths of each strategy while suppressing noise. To systematically evaluate the effects of Visual Token Compression, we also develop VTCBench, a dedicated benchmark for evaluating the information loss induced by visual token compression. Extensive experiments demonstrate that despite relying on fundamental compression modules, QMoP outperforms strong baselines and delivers significant savings in memory, computation, and inference time.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.