Mixture of Mini Experts: Overcoming the Linear Layer Bottleneck in Multiple Instance Learning

cs.CV Daniel Shao, Joel Runevic, Richard J. Chen, Drew F.K. Williamson, Ahrong Kim, Andrew H. Song, Faisal Mahmood · Mar 23, 2026
Local to this browser
What it does
Multiple Instance Learning (MIL) for gigapixel pathology images relies on a single linear layer to transform general patch features into task-specific representations before aggregation. This paper identifies this linear layer as a...
Why it matters
This paper identifies this linear layer as a critical yet overlooked bottleneck and proposes Mammoth, a parameter-efficient mixture-of-experts module that replaces it with multi-headed soft routing to specialized low-rank experts. By...
Main concern
The paper presents a strong empirical case that the task-specific linear layer is indeed a bottleneck in MIL pipelines. Across 8 MIL architectures and 19 clinical tasks, Mammoth improves performance in 130 of 152 configurations (+3.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Multiple Instance Learning (MIL) for gigapixel pathology images relies on a single linear layer to transform general patch features into task-specific representations before aggregation. This paper identifies this linear layer as a critical yet overlooked bottleneck and proposes Mammoth, a parameter-efficient mixture-of-experts module that replaces it with multi-headed soft routing to specialized low-rank experts. By routing morphologically similar patches to distinct expert slots, Mammoth achieves superior performance without increasing model size, demonstrating that the feature transformation step matters more than the choice of aggregation function.

Critical review
Verdict
Bottom line

The paper presents a strong empirical case that the task-specific linear layer is indeed a bottleneck in MIL pipelines. Across 8 MIL architectures and 19 clinical tasks, Mammoth improves performance in 130 of 152 configurations (+3.8% average) while maintaining the same parameter budget as the original linear layer, suggesting the findings are robust and clinically relevant. As noted in the abstract, "when equipped with Mammoth, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer," indicating that the quality of task-specific feature transformation outweighs architectural sophistication in the aggregation stage.

“Mammoth improves performance in 130 of the 152 examined configurations, with an average +3.8% change in performance.”
paper · Abstract
“when equipped with Mammoth, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer”
paper · Abstract
What holds up

The multi-head soft MoE design is particularly well-suited to computational pathology, where hard expert assignment suffers from training instability due to limited data (<1,000 patients) and massive patch counts (≈10,000 per sample). The low-rank decomposition and weight sharing enable scaling to 30 experts without parameter inflation, and the interpretability analyses convincingly demonstrate that experts specialize in distinct morphological concepts. For instance, the authors observe that "the patches with high weights routed to slot 5 of expert 21 overlap heavily with the tumor region," while other experts specialize in stroma or lymphocytes. Furthermore, efficiency analyses confirm that "Mammoth is both faster and more lightweight than all Sparse MoE methods," achieving better performance without computational penalties.

“the patches with high weights routed to slot 5 of expert 21 (Fig. 3B) overlap heavily with the tumor region of both LUAD and LUSC slides.”
paper · Section 5.2
“Mammoth is both faster and more lightweight than all Sparse MoE methods.”
paper · Section 5.3
Main concerns

The paper occasionally overstates novelty by claiming that "none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features," whereas recent "pre-aggregation modules" (e.g., regional transformers, local self-attention) indeed modify patch features before aggregation, albeit not via MoE. The survival prediction improvements (+2.78% c-index) are more modest than classification gains, and the fixed hyperparameter configuration (E=30, H=16, S=9) across diverse tasks suggests limited exploration of task-specific architectures. As acknowledged in the limitations, "Limitations include the use of a fixed configuration of experts, slots, and heads for each task." Additionally, the Instance Gradient Interference analysis, while intuitively appealing, is presented without rigorous statistical validation or comparison to other potential mechanisms.

“While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features.”
paper · Introduction
“Limitations include the use of a fixed configuration of experts, slots, and heads for each task.”
paper · Section 6
Evidence and comparison

The evidence robustly supports Mammoth's superiority over standard linear layers and alternative MoE variants (Soft MoE, Sparse MoE, PaMoE), with comprehensive ablations showing each design component contributes to performance. However, comparisons to other pre-aggregation approaches such as regional feature re-embedding (RRT) or local self-attention are limited to limited-task ablations, making it unclear whether the gains stem specifically from the MoE mechanism or simply from consolidating similar patches prior to aggregation. The authors note that "Performance is measured with ABMIL averaged across six tasks" for ablations, but Table 4 shows RRT achieves only −2.9% relative to Mammoth, suggesting that while Mammoth is superior, the margin over other pre-aggregation methods is narrower than the margin over baselines.

“Extensive ablations reveal that Mammoth surpasses other MoE adaptations in CPath.”
paper · Abstract
“Performance is measured with ABMIL averaged across six tasks: BRACS C/F, EBRAINS C/F, and GBMLGG C/F.”
paper · Section 5.3
Reproducibility

Reproducibility is strong: code is available at the GitHub repository, all datasets are public (TCGA, EBRAINS, BRACS, etc.), and hyperparameters are explicitly documented (E=30, H=16, S=9, AdamW with learning rate 1×10−4). The paper provides detailed implementation in Appendices A1-A4, including exact training procedures and random seeding strategies for slot initialization. As stated, "We train all models with the AdamW optimizer with a learning rate of 1×10−4, a cosine decay scheduler, and mixed precision." However, exact training times per experiment are not reported, and the dependency of soft routing on randomly initialized slot prototypes could introduce variance in reproduction that is not fully characterized.

“Code available at https://github.com/mahmoodlab/mammoth”
paper · Header
“We train all models with the AdamW optimizer with a learning rate of 1×10−4, a cosine decay scheduler, and mixed precision according to PyTorch's native implementation.”
paper · Appendix A2
Abstract

Multiple Instance Learning (MIL) is the predominant framework for classifying gigapixel whole-slide images in computational pathology. MIL follows a sequence of 1) extracting patch features, 2) applying a linear layer to obtain task-specific patch features, and 3) aggregating the patches into a slide feature for classification. While substantial efforts have been devoted to optimizing patch feature extraction and aggregation, none have yet addressed the second point, the critical layer which transforms general-purpose features into task-specific features. We hypothesize that this layer constitutes an overlooked performance bottleneck and that stronger representations can be achieved with a low-rank transformation tailored to each patch's phenotype, yielding synergistic effects with any of the existing MIL approaches. To this end, we introduce MAMMOTH, a parameter-efficient, multi-head mixture of experts module designed to improve the performance of any MIL model with minimal alterations to the total number of parameters. Across eight MIL methods and 19 different classification tasks, we find that such task-specific transformation has a larger effect on performance than the choice of aggregation method. For instance, when equipped with MAMMOTH, even simple methods such as max or mean pooling attain higher average performance than any method with the standard linear layer. Overall, MAMMOTH improves performance in 130 of the 152 examined configurations, with an average $+3.8\%$ change in performance. Code is available at https://github.com/mahmoodlab/mammoth.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.