Reframing Long-Tailed Learning via Loss Landscape Geometry
This paper addresses long-tailed (LT) learning by proposing that the head-tail performance trade-off stems from "tail performance degradation"—where models overfit to head classes and forget tail classes. The core idea reframes LT learning as continual learning, using a Grouped Knowledge Preservation (GKP) module to maintain class-specific optimal parameters and a Grouped Sharpness Aware (GSA) module to find flatter minima. The method operates without external data or pre-trained models, showing improvements on CIFAR-LT, ImageNet-LT, and iNaturalist benchmarks.
The paper presents a reasonable framework combining continual learning with sharpness-aware minimization for long-tailed recognition. However, the core contributions are largely incremental: GKP adapts Elastic Weight Consolidation (EWC) with spectral clustering, while GSA extends class-conditional SAM ideas to a group level. The theoretical justification for the group-specific perturbation radius (Eq. 13) explicitly concedes in the supplementary that it "requires empirical tuning" (Section 9.2, Remarks). The performance gains (1.3% over BCL on CIFAR100-LT) are modest but consistent. The paper claims state-of-the-art without requiring external data, yet methods like Meta-CALA [72] achieve 52.3%—identical to this paper's 53.2% when accounting for variance—suggesting the improvements are more marginal than advertised.
The intuition about "tail performance degradation" aligns with established continual learning phenomena, and the gradient decomposition in GSA (Eq. 12) reasonably addresses the head-dominated gradient problem in SAM. The ablation studies properly validate that the projected component alone (GSA-proj) yields catastrophic degradation (46.4% vs 53.2% for full GSA), confirming that removing the head-dominated component matters. The convergence proof showing linear rate to a neighborhood (Eq. 36) is technically sound given the P-L condition assumptions. The method requires no external data—an important practical constraint the authors consistently emphasize.
First, the theoretical justification for the group-specific radius (Eq. 13) admits in the supplementary that it relies on approximations requiring empirical tuning, undermining the claim of a principled derivation. Second, the effectiveness of spectral clustering for grouping classes based on parameter similarity is asserted but not validated—no analysis shows whether grouped classes actually share meaningful data distributions or merely happen to have similar convergence trajectories. Third, the computational overhead is significant: the method requires (G+1) forward-backward passes per step compared to SAM's 2 passes, yet Table 7 shows 1.67h vs BCL's 1.05h—a 59% increase—raising deployment concerns. Fourth, the claim of "significant performance gains over state-of-the-art methods" (Abstract) is overstated; the improvement over BCL is 1.3% on CIFAR100-LT and 1.9% on ImageNet-LT. The paper uses an adaptive parameter α that schedules from 0.95 to 0.6 (Eq. 20) but provides no ablation showing whether this scheduling actually matters versus a fixed weight.
The evidence supports that the combination of GKP and GSA improves over either alone (Table 4: +0.8% for BCL+GKP, +0.5% for BCL+GSA, +1.3% for full method). However, comparisons to related work require scrutiny. The BCL baseline of 51.9% differs from published BCL results, suggesting potential implementation differences. The GBG method [28] achieves 52.3% on CIFAR100-LT r=100—within error bars of this paper's 53.2%—yet is evaluated on different splits. The "Many/Med/Few" breakdown shows this method trades 0.9 percentage points on "Many" classes for 1.8-2.0 gains on "Med" and "Few" compared to BCL, suggesting the method shifts the bias rather than fundamentally resolving the seesaw dilemma. The iNaturalist 2018 result (74.4% vs Meta-CALA's 74.0%) is a marginal improvement given the dataset scale. The paper does not report standard deviations across runs, making statistical significance impossible to assess.
The paper provides a code URL (<https://gkp-gsa.github.io/>) but this was inaccessible at review time. Key hyperparameters are reported: λ=100 for preservation strength (Table 6), G=4 groups for CIFAR and iNaturalist, G=6 for ImageNet-LT, and α scheduled from 0.95 to 0.6. However, critical details are missing: (1) how the group number G is selected per dataset without validation-set tuning, (2) the computational cost of spectral clustering at each epoch, (3) memory requirements for storing per-class optimal parameters θenc^c, and (4) the exact Fisher Information Matrix approximation used for Fj,i in Eq. 6. The supplementary mentions ImageNet-LT uses 90 epochs vs iNaturalist's 100 with different learning rates (0.1 vs 0.2), but no rationale is given. Without code access and with significant method complexity (two loss components, adaptive weighting, memory bank updates, clustering), independent reproduction would face substantial challenges.
Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.