Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification
Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS, which combines parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) in the text encoder with a hierarchical textual guidance strategy for WSI representation learning. The core innovation replaces expensive cross-attention mechanisms with lightweight linear transformations $y = \gamma \cdot x + \beta$ while avoiding hard instance filtering through soft cosine-similarity-based attention refinement, achieving up to 13.8\% accuracy gains with 18.1\% fewer parameters than state-of-the-art methods.
HIPSS delivers a pragmatic and effective solution to FSWC by substituting compute-heavy cross-attention with SSF layers and eliminating aggressive instance filtering. The empirical gains across three cancer datasets are substantial and consistent, though the contribution represents a targeted adaptation of existing techniques (SSF from vision encoders to text encoders) rather than a fundamental methodological breakthrough. The paper's claim that "scaling and shifting features is unexplored in the domain of VLM-based prompt tuning" is technically defensible but overstates the novelty given the direct lineage from Lian et al. 2022.
The SSF-based prompt tuning is convincingly effective, reducing trainable parameters to 0.2844M compared to 0.7053M for LoRA while achieving superior AUC scores (0.9105 vs 0.8513 on Camelyon16 16-shot). The hierarchical textual guidance elegantly leverages the inherent WSI structure without discarding potentially informative instances via hard thresholds. This approach yields strong weakly-supervised localization capabilities with a dice coefficient of 0.732, significantly outperforming MIL baselines. The ablation studies rigorously demonstrate that both SSF tuning and the text-guided attention refinement contribute meaningfully to the final performance.
The reliance on ChatGPT-4o for generating pathological descriptions introduces reproducibility and validation risks that the authors only acknowledge as a limitation in the conclusion: "A key limitation is the lack of validation of LLM-generated descriptions, which could be addressed through expert-in-the-loop evaluations." The method requires dataset-specific hyperparameter choices for the SSF tuning depth ($d_s$=2 for Camelyon16/TCGA-Lung vs $d_s$=8 for UBC-OCEAN) without providing clear selection criteria, suggesting potential instability when applied to new domains. Additionally, the attention refinement factor $\lambda=10$ and threshold $\alpha=0.2$ appear empirically determined with limited sensitivity analysis beyond Figure 3.
The experimental protocol is generally rigorous, employing CONCH as a consistent backbone across all baselines and reporting mean AUC "averaged over 3 different folds, each evaluated using 20 random seeds" to mitigate few-shot variance. However, the comparison with LoRA is limited to a single dataset (Camelyon16), weakening generalizability claims about parameter efficiency. Furthermore, the authors admit that FAST baseline results are imported directly from the original paper rather than reproduced under the same experimental setup because "this approach requires instance-level annotations... which are not publicly available," potentially confounding cross-method comparisons.
Independent reproduction faces significant obstacles due to the absence of released code and the dependence on proprietary ChatGPT-4o for generating class descriptions, which may yield inconsistent outputs across API versions. While key hyperparameters are disclosed—including the attention weights refinement factor ($\lambda=10$) and threshold ($\alpha=0.2$)—the paper notes these were "determined empirically" without providing selection protocols. The requirement for dataset-specific tuning depths (varying $d_s$ from 2 to 8 layers) without principled guidance limits generalizability to new pathology datasets. Additionally, no memory consumption metrics or training time measurements are provided beyond inference latency, obscuring the true computational cost of the hierarchical attention mechanism.
Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.