Pruned Adaptation Modules: A Simple yet Strong Baseline for Continual Foundation Models

cs.LG Elif Ceren Gok Yildirim, Murat Onur Yildirim, Joaquin Vanschoren · Mar 22, 2026

What it does

Why it matters

The authors propose Pruned Adaptation Modules (PAM), which freeze early ResNet layers and introduce sparsely structured task-specific modules, yielding significant parameter reductions while improving accuracy. This work fills a critical...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

The paper challenges the rapid shift toward Vision Transformer-based continual learning by demonstrating that lightweight, pruned Convolutional Networks can outperform existing foundation model approaches. The authors propose Pruned Adaptation Modules (PAM), which freeze early ResNet layers and introduce sparsely structured task-specific modules, yielding significant parameter reductions while improving accuracy. This work fills a critical methodological gap by establishing a strong, efficient baseline that questions whether recent advances reflect genuine progress or merely the absence of rigorous ConvNet comparisons.

Critical review

Verdict

Bottom line

PAM establishes itself as a compelling baseline for class-incremental learning, delivering on its promise of parameter efficiency and competitive performance. The core mechanism—structured pruning of task-specific ResNet layers combined with confidence-based inference—is well-motivated and empirically validated across diverse benchmarks. However, the comparison between ResNets trained solely on ImageNet-1K and ViT-B/16 models pre-trained on ImageNet-21K introduces a confounding factor regarding representational capacity that complicates the attribution of performance gains solely to the proposed architectural innovations.

What holds up

The parameter efficiency claims are substantiated by clear evidence; the paper demonstrates a "~5× reduction in trainable parameters and a ~6× reduction in total parameters" compared to existing FM-based methods. The ablation studies demonstrating that pruning after the first epoch outperforms later pruning (Figure 4a), and that confidence-based module selection exceeds distance-based alternatives (Figure 4c), provide robust support for the design choices. The stability of PAM in long-horizon experiments on ImageNet-R (Figure 6) demonstrates superior resistance to catastrophic forgetting compared to prompt-based and adapter-based alternatives.

“PAM yields up to a ~5× reduction in trainable parameters and a ~6× reduction in total parameters”

paper · Abstract

“PAM uses 5× and 2× fewer trainable parameters than the state-of-the-art prompt-based CODA-Prompt and adapter-based EASE”

paper · Section 5.1

Main concerns

The evaluation protocol employs ResNets pre-trained on ImageNet-1K while comparing against ViT-B/16 models initialized from ImageNet-21K checkpoints, as noted in the implementation details: "existing methods utilize the pre-trained ViT-B/16-IN21K model which initially trained on ImageNet-21K... we employ pre-trained ResNet... models that are trained solely on ImageNet-1K." This mismatch in pre-training data scale and diversity obscures whether PAM's success stems from superior architectural design or from inherent differences in the base models' learned representations. Additionally, the confidence-based selection mechanism relies on maximum softmax probabilities (Equation 6: $\hat{b}=\arg\max_b \frac{1}{|\mathbf{x}_{test}|}\sum \max_{y} p_b(y|x_i)$) without calibration or theoretical analysis of failure modes when task distributions overlap significantly, and the method's reliance on ResNet-specific architectural constraints limits generalizability to other backbone types.

“existing methods utilize the pre-trained ViT-B/16-IN21K model which initially trained on ImageNet-21K and subsequently fine-tuned on ImageNet-1K, we employ pre-trained ResNet18, ResNet50, ResNet101 and ResNet152 models that are trained solely on ImageNet-1K”

paper · Section 4 (Implementation Details)

“\hat{b}=\arg\max_{b}\frac{1}{|\mathbf{x}_{test}|}\sum_{x_{i}\in\mathbf{x}_{test}}\max_{y\in\mathcal{Y}_{b}}p_{b}(y\mid x_{i})”

paper · Equation 6

Evidence and comparison

The experimental evidence strongly supports the efficiency claims and demonstrates that PAM outperforms state-of-the-art methods on CIFAR-100, CUB-200, and Cars-196, though average accuracy occasionally lags on certain benchmarks. Table 1 shows PAM (RN152) achieves 93.79% final accuracy on CIFAR B0 Inc5 versus 85.97% for EASE and 81.46% for CODA-Prompt. However, comparisons to related work would benefit from controlled experiments using identical pre-training regimes to disentangle the impact of architectural choices from pre-training data advantages. The saliency-based pruning using $L_1$-norm ($s_c = \sum|W_c^i|$ in Equation 2) is standard but effective, though the paper does not compare against unstructured pruning or alternative importance metrics.

“s_{c}=\sum|W_{c}^{i}|”

paper · Equation 2

“PAM (RN152) 94.17 ± 1.4 93.79 ± 1.7”

paper · Table 1

Reproducibility

The paper provides detailed implementation specifics including the Adam optimizer, learning rate $0.001$, batch size 48, and 96% pruning magnitude applied after the first epoch ($e=1$ in Algorithm 1). Experiments were conducted using PyTorch and the PILOT framework with five random seeds, enhancing reliability. Nevertheless, the absence of an explicit code repository link or data preparation scripts in the provided text raises concerns about practical reproducibility, particularly for the confidence-based inference algorithm and the exact structured pruning implementation. The pseudocode in Algorithm 1 clarifies the training loop but omits low-level details such as batch normalization handling during pruning or the specific random seed initialization for the classification head.

“if e=1 then Train \gamma_{b}... Rank saliency... Obtain \mathcal{S}_{b}”

paper · Algorithm 1

“For our method, PAM, we train the models for 25 epochs using the Adam optimizer with a batch size of 48 and a learning rate of 0.001”

paper · Section 4

Abstract

The continual learning literature has rapidly shifted from traditional class incremental learning (CIL) techniques to foundation model (FM)-based CIL methods without a clear understanding of how these newer approaches compare to strong, lightweight convolutional baselines. This abrupt transition has created a substantial methodological gap, making it difficult to assess whether recent FM-based CIL progress reflects genuine advances or merely the absence of rigorous baselines. To address this gap, we introduce Pruned Adaptation Modules (PAM), a simple yet effective method that freezes the vast majority of the pre-trained ResNet while enabling scalable continual adaptation through sparse task-specific layers. PAM yields up to a ~5x reduction in trainable parameters and a ~6x reduction in total parameters, significantly reducing the cost of continual updates. Across diverse benchmarks, PAM consistently mitigates catastrophic forgetting and outperforms state-of-the-art FM-based CIL approaches. Our findings position PAM as a strong and transparent baseline that helps bridge the gap between traditional and FM-based CIL, guiding future research for a more accurate assessment of true progress in continual adaptation. The code can be found at: https://github.com/ElifCerenGokYildirim/PAM.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.