SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation
SegMaFormer proposes a hybrid encoder for 3D medical image segmentation that places Mamba state-space layers in early high-resolution stages (for linear-complexity sequence mixing) and self-attention only in deeper low-resolution stages (where quadratic cost is manageable). The goal is to reduce the prohibitive compute of full 3D attention while preserving global context. With just 2M parameters and 15 GFLOPs, the authors claim competitive results on BraTS, Synapse, and ACDC benchmarks against models up to 75\times larger.
SegMaFormer delivers genuine efficiency gains but trades away meaningful accuracy. The core idea—using Mamba's $\mathcal{O}(N/r \cdot d^2)$ complexity early and attention only when $N \ll d$—is sound and produces a very compact model (2.02M params vs. nnFormer's 150.5M). However, the claim of 'competitive' performance is overstated: on Synapse it trails nnFormer by 3.2 Dice points (83.33 vs. 86.57) and U-Mamba by 4.7 points; on BraTS it is 2.6 points behind nnFormer; on ACDC it sits 1.4 points below the actual SOTA (Primus-S). The paper also omits comparison to the true baseline—nnU-Net—which large-scale benchmarks (Isensee et al.) show still dominates these datasets.
The staged architecture reasoning holds: placing Mamba blocks where token counts are large and self-attention where $N \ll d$ correctly addresses the quadratic bottleneck of 3D volumetric attention. The efficiency claims are valid—2M parameters and 15 GFLOPs is genuinely lightweight compared to the 150M+ parameter models it cites. The addition of 3D-RoPE to patch embeddings is a sensible enhancement for spatial awareness, and the all-MLP decoder design is consistent with prior efficient segmentation work. The paper also correctly cites the growing concern that hybrid CNN-Transformer models derive most performance from convolutions, not attention.
The central flaw is selective benchmarking against weaker baselines while omitting the actual SOTA. The paper does not report nnU-Net numbers, despite Isensee et al.'s comprehensive study showing properly configured nnU-Net outperforms nearly all Transformer and Mamba variants. On Synapse, SegMaFormer particularly struggles with small organs (gallbladder 57.29 vs. U-Mamba's 73.80, pancreas 70.57 vs. 79.3), confirming the authors' own admission that Mamba-based architectures suffer on fine-grained structures. The '75× fewer parameters' comparison is technically true versus nnFormer but misleading as it compares against an attention-heavy model known to be suboptimal. The BraTS dataset is explicitly criticized in nnU-Net Revisited as having 'low systematic variance' and being unsuitable for methodological benchmarking, yet it constitutes one-third of the evaluation.
The comparisons to SegFormer3D are fair (same pipeline, no pretraining), and SegMaFormer does consistently outperform this baseline by 1–2 Dice points. However, comparisons to UNETR, TransUNet, and TransBTS are largely irrelevant—these are known to be suboptimal baselines that PRIMUS and nnU-Net Revisited have already shown to underperform against modern CNNs. The paper fails to acknowledge that its 'competitive' results still trail the actual state-of-the-art by significant margins (1–4 Dice points). No statistical significance testing is provided, and the use of dual RTX 4060Ti GPUs (an unusual hardware choice) raises questions about training stability versus standard A100/V100 setups used in comparable works.
The paper adopts the nnUNet framework for training but does not follow nnUNet's automatic configuration paradigm, instead using fixed hyperparameters (AdamW, lr=3e-4, cosine annealing). This makes fair reproduction difficult because nnUNet's strength lies in its dataset-specific optimization. No code or trained weights are released, and critical implementation details—such as the exact Mamba block configuration (expansion ratio, state dimension), 3D-RoPE frequency settings, and whether depth-wise convolutions are used—are vague. The efficiency numbers (15.2 GFLOPs) are reported but not validated against actual inference timing on standard hardware. The paper mentions using 'optional Deep Supervision' without specifying when it is enabled, a significant confounder for reproducibility.
The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.