Aggregation Alignment for Federated Learning with Mixture-of-Experts under Data Heterogeneity

cs.LG cs.AI Zihan Fang, Qianru Wang, Haonan An, Zheng Lin, Yiqin Deng, Xianhao Chen, Yuguang Fang · Mar 22, 2026
Local to this browser
What it does
The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The...
Why it matters
The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated...
Main concern
The paper presents a technically sound and well-motivated solution to the aggregation challenges in federated MoE fine-tuning. The shift from parameter-space to function-space alignment (routing distributions and semantic roles) is...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated weights ($\gamma_{i,j}(e)$). This matters because MoE architectures are increasingly vital for scaling LLMs efficiently, yet data heterogeneity across federated clients undermines their specialization benefits.

Critical review
Verdict
Bottom line

The paper presents a technically sound and well-motivated solution to the aggregation challenges in federated MoE fine-tuning. The shift from parameter-space to function-space alignment (routing distributions and semantic roles) is conceptually rigorous, and the empirical validation demonstrates consistent improvements over baselines. However, the approach incurs $O(N^2)$ complexity for semantic aggregation, lacks theoretical convergence guarantees, and raises unaddressed privacy concerns regarding the transmission of hidden-state representations.

“We characterize each expert's functional behavior using two semantic indicators derived from its input-space assignments and local weight updates...”
paper · Section III-D1
“This formulation tightens the threshold when expert semantics are well aligned across clients and relaxes it when expert roles remain diverse...”
paper · Section III-D2
What holds up

The motivation section provides compelling empirical evidence for the problem: Figures 2-4 demonstrate that direct parameter aggregation severely disrupts gating decisions and causes expert semantic divergence. The ablation studies in Section V-C rigorously validate each component, showing that removing routing consistency weighting or adaptive expert-level weights degrades performance. The adaptive threshold mechanism ($\tau_e^t = M(e) - \beta \cdot \Sigma(e)$) is a thoughtful design that dynamically calibrates aggregation sensitivity based on semantic dispersion.

“Fig.[3a] reveals that directly averaging gating parameters across clients severely disrupts learned routing decisions, yielding a global gating network whose expert selection is misaligned with all clients...”
paper · Section II-A
“When direction consensus is removed... experts serving similar regions may still learn incompatible feature-label mappings, which introduces inconsistent optimization directions...”
paper · Section V-C2
Main concerns

First, the semantic-aware expert aggregation requires computing pairwise similarities between all clients for each expert ($O(N^2)$), which scales poorly and incurs significant communication overhead for transmitting hidden-state representations ($\mu_i(e)$) and weight updates. Second, the paper under-discusses privacy implications: sharing input-space assignments (hidden state averages) and detailed routing distributions may leak sensitive information about client data, undermining the privacy-preserving premise of federated learning. Third, there is no theoretical analysis of convergence guarantees for the proposed aggregation scheme. Finally, while claiming applicability to LLMs, the evaluation uses relatively modest models (Switch-base-16, DeepSeek-MoE-16B) without validating the approach at the scale of current frontier models.

“Let $\mathcal{H}_i(e) = \{h \mid \text{argmax } G(h) = e\}$ denote the set of hidden states routed to expert $e$...”
paper · Section III-D1
“Upload $\{\overline{p}_i(e), \Delta\theta_i^e, \mu_i(e), o_i(e), m_i(e)\}$ to server...”
paper · Algorithm 1
Evidence and comparison

The experimental evidence supports the main claims: FedAlign-MoE achieves 4-7% accuracy improvements over FedMoE and FedAvg across AGNews, PIQA, HellaSwag, and MMLU benchmarks (Table I), with faster convergence (Figure 9). The comparison to related work is fair, covering FedAvg, FedProx, PFL-MoE, and FedMoE. However, the paper could better position itself against very recent federated MoE approaches (e.g., FedMoE-DA, pFedMoE cited in Section VI) by including direct experimental comparisons rather than just conceptual distinctions.

“FedAlign-MoE consistently outperforms all baselines in all four datasets, achieving an accuracy improvement of 4% over FedMoE and 7% over FedAvg...”
paper · Table I
“FedAlign-MoE consistently achieves fastest convergence, yielding over x1.3 and x1.7 speedups on Switch-base-16 and DeepSeek-MoE-16B models...”
paper · Section V-A2
Reproducibility

Reproducibility is moderately supported but incomplete. Algorithm 1 provides detailed steps, and hyperparameters are specified ($\lambda=0.1$, $\eta=0.1$, learning rate $1\times 10^{-4}$). However, code release is not mentioned. The protocol requires clients to upload routing distributions, semantic statistics, and hidden-state representations (input-space assignments), which significantly increases communication overhead compared to standard FedAvg, potentially blocking reproduction in resource-constrained settings. The $O(N^2)$ pairwise similarity computation for semantic aggregation (Equation 14) may become prohibitive as client numbers grow beyond the tested range (10-50 clients).

“The key hyper-parameters in FedAlign-MoE are set to $\lambda=0.1$ and $\eta=0.1$...”
paper · Section IV-4
“Calculate region-conditioned weight $\gamma_{i,j}(e)$ using Eqn. (14)...”
paper · Algorithm 1
Abstract

Large language models (LLMs) increasingly adopt Mixture-of-Experts (MoE) architectures to scale model capacity while reducing computation. Fine-tuning these MoE-based LLMs often requires access to distributed and privacy-sensitive data, making centralized fine-tuning impractical. Federated learning (FL) therefore provides a paradigm to collaboratively fine-tune MoE-based LLMs, enabling each client to integrate diverse knowledge without compromising data privacy. However, the integration of MoE-based LLM fine-tuning into FL encounters two critical aggregation challenges due to inherent data heterogeneity across clients: (i) divergent local data distributions drive clients to develop distinct gating preference for localized expert selection, causing direct parameter aggregation to produce a ``one-size-fits-none'' global gating network, and (ii) same-indexed experts develop disparate semantic roles across clients, leading to expert semantic blurring and the degradation of expert specialization. To address these challenges, we propose FedAlign-MoE, a federated aggregation alignment framework that jointly enforces routing consistency and expert semantic alignment. Specifically, FedAlign-MoE aggregates gating behaviors by aligning routing distributions through consistency weighting and optimizes local gating networks through distribution regularization, maintaining cross-client stability without overriding discriminative local preferences. Meanwhile, FedAlign-MoE explicitly quantifies semantic consistency among same-indexed experts across clients and selectively aggregates updates from semantically aligned clients, ensuring stable and specialized functional roles for global experts. Extensive experiments demonstrate that FedAlign-MoE outperforms state-of-the-art benchmarks, achieving faster convergence and superior accuracy in non-IID federated environments.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.