Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification

cs.CV Jayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge · Mar 23, 2026
Local to this browser
What it does
Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS,...
Why it matters
8\% accuracy gains with 18. 1\% fewer parameters than state-of-the-art methods.
Main concern
HIPSS delivers a pragmatic and effective solution to FSWC by substituting compute-heavy cross-attention with SSF layers and eliminating aggressive instance filtering. The empirical gains across three cancer datasets are substantial and...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS, which combines parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) in the text encoder with a hierarchical textual guidance strategy for WSI representation learning. The core innovation replaces expensive cross-attention mechanisms with lightweight linear transformations $y = \gamma \cdot x + \beta$ while avoiding hard instance filtering through soft cosine-similarity-based attention refinement, achieving up to 13.8\% accuracy gains with 18.1\% fewer parameters than state-of-the-art methods.

Critical review
Verdict
Bottom line

HIPSS delivers a pragmatic and effective solution to FSWC by substituting compute-heavy cross-attention with SSF layers and eliminating aggressive instance filtering. The empirical gains across three cancer datasets are substantial and consistent, though the contribution represents a targeted adaptation of existing techniques (SSF from vision encoders to text encoders) rather than a fundamental methodological breakthrough. The paper's claim that "scaling and shifting features is unexplored in the domain of VLM-based prompt tuning" is technically defensible but overstates the novelty given the direct lineage from Lian et al. 2022.

“To the best of our knowledge, scaling and shifting features is unexplored in the domain of VLM-based prompt tuning.”
Bogahawatte et al. (this paper) · Introduction, paragraph 4
“Pathology related descriptions were generated using ChatGPT-4o, based on a set of queries for each class.”
Bogahawatte et al. (this paper) · Section 4.1, paragraph 3
What holds up

The SSF-based prompt tuning is convincingly effective, reducing trainable parameters to 0.2844M compared to 0.7053M for LoRA while achieving superior AUC scores (0.9105 vs 0.8513 on Camelyon16 16-shot). The hierarchical textual guidance elegantly leverages the inherent WSI structure without discarding potentially informative instances via hard thresholds. This approach yields strong weakly-supervised localization capabilities with a dice coefficient of 0.732, significantly outperforming MIL baselines. The ablation studies rigorously demonstrate that both SSF tuning and the text-guided attention refinement contribute meaningfully to the final performance.

“Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets”
Bogahawatte et al. (this paper) · Abstract
“Our method achieves a dice coefficient of 0.732 without any localization-specific finetuning.”
Bogahawatte et al. (this paper) · Section 4.3, paragraph 6
Main concerns

The reliance on ChatGPT-4o for generating pathological descriptions introduces reproducibility and validation risks that the authors only acknowledge as a limitation in the conclusion: "A key limitation is the lack of validation of LLM-generated descriptions, which could be addressed through expert-in-the-loop evaluations." The method requires dataset-specific hyperparameter choices for the SSF tuning depth ($d_s$=2 for Camelyon16/TCGA-Lung vs $d_s$=8 for UBC-OCEAN) without providing clear selection criteria, suggesting potential instability when applied to new domains. Additionally, the attention refinement factor $\lambda=10$ and threshold $\alpha=0.2$ appear empirically determined with limited sensitivity analysis beyond Figure 3.

“A key limitation is the lack of validation of LLM-generated descriptions, which could be addressed through expert-in-the-loop evaluations.”
Bogahawatte et al. (this paper) · Section 5
“2 layers were used for C16 and TCGA-Lung datasets, while 8 layers were found optimal for UBC-OCEAN dataset”
Bogahawatte et al. (this paper) · Section 4.1, paragraph 3
Evidence and comparison

The experimental protocol is generally rigorous, employing CONCH as a consistent backbone across all baselines and reporting mean AUC "averaged over 3 different folds, each evaluated using 20 random seeds" to mitigate few-shot variance. However, the comparison with LoRA is limited to a single dataset (Camelyon16), weakening generalizability claims about parameter efficiency. Furthermore, the authors admit that FAST baseline results are imported directly from the original paper rather than reproduced under the same experimental setup because "this approach requires instance-level annotations... which are not publicly available," potentially confounding cross-method comparisons.

“Mean AUC values averaged over 3 different folds, each evaluated using 20 random seeds”
Bogahawatte et al. (this paper) · Table 1 caption
“For the FAST method, we report the results directly from [6], as this approach requires instance-level annotations for WSIs during training, which are not publicly available.”
Bogahawatte et al. (this paper) · Section 4.1, paragraph 4
Reproducibility

Independent reproduction faces significant obstacles due to the absence of released code and the dependence on proprietary ChatGPT-4o for generating class descriptions, which may yield inconsistent outputs across API versions. While key hyperparameters are disclosed—including the attention weights refinement factor ($\lambda=10$) and threshold ($\alpha=0.2$)—the paper notes these were "determined empirically" without providing selection protocols. The requirement for dataset-specific tuning depths (varying $d_s$ from 2 to 8 layers) without principled guidance limits generalizability to new pathology datasets. Additionally, no memory consumption metrics or training time measurements are provided beyond inference latency, obscuring the true computational cost of the hierarchical attention mechanism.

“Attention weights refinement factor was 10 and threshold value was 0.2 through all experiments.”
Bogahawatte et al. (this paper) · Section 4.1, paragraph 3
“The number of layers that we tune using SSF in the text encoder was determined empirically.”
Bogahawatte et al. (this paper) · Section 4.1, paragraph 3
Abstract

Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.