OmniFM: Toward Modality-Robust and Task-Agnostic Federated Learning for Heterogeneous Medical Imaging

cs.CV Meilin Liu, Jiaying Wang, Jing Shan · Mar 23, 2026
Local to this browser
What it does
Federated learning for medical imaging typically requires task-specific pipelines and assumes homogeneous modalities across institutions, limiting real-world deployment where hospitals use diverse scanners (MRI, CT, PET) and need to...
Why it matters
OmniFM proposes a frequency-domain insight: low-frequency spectral components exhibit cross-modality consistency and encode modality-invariant anatomical structures, enabling a single reusable optimization pipeline. The framework combines...
Main concern
OmniFM presents a compelling frequency-domain approach to cross-modality federated learning, achieving strong empirical results across five distinct medical imaging tasks. The insight that $\mathcal{P}_{\mathrm{LP}}(\mathcal{F}(x))$...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

Federated learning for medical imaging typically requires task-specific pipelines and assumes homogeneous modalities across institutions, limiting real-world deployment where hospitals use diverse scanners (MRI, CT, PET) and need to support multiple downstream tasks. OmniFM proposes a frequency-domain insight: low-frequency spectral components exhibit cross-modality consistency and encode modality-invariant anatomical structures, enabling a single reusable optimization pipeline. The framework combines Global Spectral Knowledge Retrieval, Embedding-wise Cross-Attention Fusion, and Prefix-Suffix Spectral Prompting, regularized by Spectral-Proximal Alignment to stabilize aggregation under severe modality heterogeneity.

Critical review
Verdict
Bottom line

OmniFM presents a compelling frequency-domain approach to cross-modality federated learning, achieving strong empirical results across five distinct medical imaging tasks. The insight that $\mathcal{P}_{\mathrm{LP}}(\mathcal{F}(x))$ (low-frequency spectral components) provides modality-invariant structural priors is well-motivated and effectively operationalized through the global knowledge bank and spectral alignment mechanisms. However, the 'task-agnostic' claim warrants qualification—the framework still requires task-specific heads $h_\psi$ and different input pipelines for VQA versus segmentation, though the core optimization mechanism remains shared. While the ablations confirm each component contributes, the marginal gains in some tasks suggest the frequency-domain prior may be complementary rather than transformative.

“low-frequency spectral structures remain strikingly consistent across modalities”
paper · Section 3.1
“We factorize the local model into a backbone $g_\phi$ and a task head $h_\psi$”
paper · Section 3.2
What holds up

The frequency-domain insight is empirically grounded: the paper demonstrates that $\|\mathcal{P}_{\mathrm{LP}}(\mathcal{F}(x_i)) - \mathcal{P}_{\mathrm{LP}}(\mathcal{F}(x_j))\|_2 \ll \|\mathcal{F}(x_i) - \mathcal{F}(x_j)\|_2$ across modalities, providing a principled foundation for cross-modality alignment. The comprehensive evaluation across classification, segmentation, super-resolution, VQA, and multimodal fusion is unusually broad for FL papers and strengthens the modality-robustness claims. The ablation studies (Table 8) rigorously demonstrate that removing GSKR, ECA, or PSP each degrades performance consistently across tasks, validating that the components operate synergistically rather than redundantly.

“modality-specific variability is predominantly concentrated in higher frequencies, while low-pass components encode coarse anatomical structures that are shared across modalities”
paper · Section 3.1
“Removing any module leads to consistent performance drops across tasks”
paper · Table 8
Main concerns

The 'task-agnostic' framing is somewhat misleading: while the optimization pipeline $\min_{\phi,\psi} \mathcal{L}_{\text{task}} + \lambda\mathcal{L}_{\text{align}}$ is reused, the paper acknowledges using task-specific heads $h_\psi$ and different backbones for different tasks (ResNet-18 for classification, U-Net for segmentation, U2Fusion for fusion), which contradicts the implication of a truly unified model architecture. The communication overhead of uploading spectral embeddings $\mathbf{s}_k^{(r)}$ to a growing global knowledge bank $\mathcal{K}^{(r)}$ is not analyzed—unlike standard FL that shares only gradients or model weights, this requires transmitting learned embeddings and maintaining a server-side bank, potentially violating strict federated privacy constraints in some medical settings. The 'extra-hard' scenario lacks clear definition in the main text ('further introduce an extra-hard modality-heterogeneous setup' with details deferred to an Appendix not provided), and the comparison omits recent domain adaptation baselines that explicitly handle modality shift.

“further introduce an extra-hard modality-heterogeneous setup”
paper · Section 4.1
“Clients periodically upload spectral embeddings to refine the global knowledge bank”
paper · Section 3.3
Evidence and comparison

The evidence supports the modality-robustness claims well: OmniFM outperforms FedPer, FedDyn, and FedProx across all tested scenarios, with particularly strong gains in cross-modality classification (96.85% vs 84.47% for FedPer in Scenario 1 Hard) and segmentation (Dice 77.41 vs 73.86 for FedPer). However, the comparison does not include recent foundation-model-based FL approaches like FedDAT or F3OCUS in the main medical imaging results (they appear only in Table 5 for VQA), potentially missing stronger baselines. The paper does not establish whether the frequency-domain approach outperforms simple domain adversarial training or gradient matching techniques specifically designed for modality shift, leaving open whether the spectral insight is strictly necessary or if generic domain alignment suffices.

“OmniFM (Ours) 96.85 97.71 97.82 vs FedPer 84.47 91.02 92.57”
paper · Table 1
“OmniFM 77.41 vs FedPer 73.86 average Dice”
paper · Table 6
Reproducibility

Reproducibility is moderate. The paper provides implementation details including learning rates, batch sizes, and communication rounds (e.g., 100 rounds with 5 local epochs for classification), and specifies hyperparameters like $\lambda$ and $k$ (top-k retrieval) with sensitivity analysis in Figure 5. However, the code and data pre-processing scripts are not mentioned as publicly available, and critical details for the knowledge bank update—such as the pruning threshold $\delta$, the Hilbert ball radius $\rho$, and the dimensionality $d$ of spectral embeddings—are not specified. The random seeds and exact data partitioning indices for the Dirichlet distribution ($\gamma \in \{0.1, 0.5\}$) are not provided, which could affect reproduction of the non-IID splits. Without access to the spectral tokenization module implementation, independent reproduction of the FreqMix and pooling operations would be challenging.

“projection onto a Hilbert ball $\mathbb{B}(\rho)=\{\mathbf{s} \mid \|\mathbf{s}\|_2 \leq \rho\}$”
paper · Section 3.3
“train for 100 rounds with 5 local epochs per round”
paper · Section 4.1
Abstract

Federated learning (FL) has become a promising paradigm for collaborative medical image analysis, yet existing frameworks remain tightly coupled to task-specific backbones and are fragile under heterogeneous imaging modalities. Such constraints hinder real-world deployment, where institutions vary widely in modality distributions and must support diverse downstream tasks. To address this limitation, we propose OmniFM, a modality- and task-agnostic FL framework that unifies training across classification, segmentation, super-resolution, visual question answering, and multimodal fusion without re-engineering the optimization pipeline. OmniFM builds on a key frequency-domain insight: low-frequency spectral components exhibit strong cross-modality consistency and encode modality-invariant anatomical structures. Accordingly, OmniFM integrates (i) Global Spectral Knowledge Retrieval to inject global frequency priors, (ii) Embedding-wise Cross-Attention Fusion to align representations, and (iii) Prefix-Suffix Spectral Prompting to jointly condition global and personalized cues, together regularized by a Spectral-Proximal Alignment objective that stabilizes aggregation. Experiments on real-world datasets show that OmniFM consistently surpasses state-of-the-art FL baselines across intra- and cross-modality heterogeneity, achieving superior results under both fine-tuning and training-from-scratch setups.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.