Aggregation Alignment for Federated Learning with Mixture-of-Experts under Data Heterogeneity
The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated weights ($\gamma_{i,j}(e)$). This matters because MoE architectures are increasingly vital for scaling LLMs efficiently, yet data heterogeneity across federated clients undermines their specialization benefits.
The paper presents a technically sound and well-motivated solution to the aggregation challenges in federated MoE fine-tuning. The shift from parameter-space to function-space alignment (routing distributions and semantic roles) is conceptually rigorous, and the empirical validation demonstrates consistent improvements over baselines. However, the approach incurs $O(N^2)$ complexity for semantic aggregation, lacks theoretical convergence guarantees, and raises unaddressed privacy concerns regarding the transmission of hidden-state representations.
The motivation section provides compelling empirical evidence for the problem: Figures 2-4 demonstrate that direct parameter aggregation severely disrupts gating decisions and causes expert semantic divergence. The ablation studies in Section V-C rigorously validate each component, showing that removing routing consistency weighting or adaptive expert-level weights degrades performance. The adaptive threshold mechanism ($\tau_e^t = M(e) - \beta \cdot \Sigma(e)$) is a thoughtful design that dynamically calibrates aggregation sensitivity based on semantic dispersion.
First, the semantic-aware expert aggregation requires computing pairwise similarities between all clients for each expert ($O(N^2)$), which scales poorly and incurs significant communication overhead for transmitting hidden-state representations ($\mu_i(e)$) and weight updates. Second, the paper under-discusses privacy implications: sharing input-space assignments (hidden state averages) and detailed routing distributions may leak sensitive information about client data, undermining the privacy-preserving premise of federated learning. Third, there is no theoretical analysis of convergence guarantees for the proposed aggregation scheme. Finally, while claiming applicability to LLMs, the evaluation uses relatively modest models (Switch-base-16, DeepSeek-MoE-16B) without validating the approach at the scale of current frontier models.
The experimental evidence supports the main claims: FedAlign-MoE achieves 4-7% accuracy improvements over FedMoE and FedAvg across AGNews, PIQA, HellaSwag, and MMLU benchmarks (Table I), with faster convergence (Figure 9). The comparison to related work is fair, covering FedAvg, FedProx, PFL-MoE, and FedMoE. However, the paper could better position itself against very recent federated MoE approaches (e.g., FedMoE-DA, pFedMoE cited in Section VI) by including direct experimental comparisons rather than just conceptual distinctions.
Reproducibility is moderately supported but incomplete. Algorithm 1 provides detailed steps, and hyperparameters are specified ($\lambda=0.1$, $\eta=0.1$, learning rate $1\times 10^{-4}$). However, code release is not mentioned. The protocol requires clients to upload routing distributions, semantic statistics, and hidden-state representations (input-space assignments), which significantly increases communication overhead compared to standard FedAvg, potentially blocking reproduction in resource-constrained settings. The $O(N^2)$ pairwise similarity computation for semantic aggregation (Equation 14) may become prohibitive as client numbers grow beyond the tested range (10-50 clients).
Large language models (LLMs) increasingly adopt Mixture-of-Experts (MoE) architectures to scale model capacity while reducing computation. Fine-tuning these MoE-based LLMs often requires access to distributed and privacy-sensitive data, making centralized fine-tuning impractical. Federated learning (FL) therefore provides a paradigm to collaboratively fine-tune MoE-based LLMs, enabling each client to integrate diverse knowledge without compromising data privacy. However, the integration of MoE-based LLM fine-tuning into FL encounters two critical aggregation challenges due to inherent data heterogeneity across clients: (i) divergent local data distributions drive clients to develop distinct gating preference for localized expert selection, causing direct parameter aggregation to produce a ``one-size-fits-none'' global gating network, and (ii) same-indexed experts develop disparate semantic roles across clients, leading to expert semantic blurring and the degradation of expert specialization. To address these challenges, we propose FedAlign-MoE, a federated aggregation alignment framework that jointly enforces routing consistency and expert semantic alignment. Specifically, FedAlign-MoE aggregates gating behaviors by aligning routing distributions through consistency weighting and optimizes local gating networks through distribution regularization, maintaining cross-client stability without overriding discriminative local preferences. Meanwhile, FedAlign-MoE explicitly quantifies semantic consistency among same-indexed experts across clients and selectively aggregates updates from semantically aligned clients, ensuring stable and specialized functional roles for global experts. Extensive experiments demonstrate that FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.