SegMaFormer: A Hybrid State-Space and Transformer Model for Efficient Segmentation

cs.CV cs.AI Duy D. Nguyen, Phat T. Tran-Truong · Mar 23, 2026

What it does

Why it matters

The goal is to reduce the prohibitive compute of full 3D attention while preserving global context. With just 2M parameters and 15 GFLOPs, the authors claim competitive results on BraTS, Synapse, and ACDC benchmarks against models up to...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

SegMaFormer proposes a hybrid encoder for 3D medical image segmentation that places Mamba state-space layers in early high-resolution stages (for linear-complexity sequence mixing) and self-attention only in deeper low-resolution stages (where quadratic cost is manageable). The goal is to reduce the prohibitive compute of full 3D attention while preserving global context. With just 2M parameters and 15 GFLOPs, the authors claim competitive results on BraTS, Synapse, and ACDC benchmarks against models up to 75\times larger.

Critical review

Verdict

Bottom line

SegMaFormer delivers genuine efficiency gains but trades away meaningful accuracy. The core idea—using Mamba's $\mathcal{O}(N/r \cdot d^2)$ complexity early and attention only when $N \ll d$—is sound and produces a very compact model (2.02M params vs. nnFormer's 150.5M). However, the claim of 'competitive' performance is overstated: on Synapse it trails nnFormer by 3.2 Dice points (83.33 vs. 86.57) and U-Mamba by 4.7 points; on BraTS it is 2.6 points behind nnFormer; on ACDC it sits 1.4 points below the actual SOTA (Primus-S). The paper also omits comparison to the true baseline—nnU-Net—which large-scale benchmarks (Isensee et al.) show still dominates these datasets.

“Ours: 83.33 Avg(%)... nnFormer: 86.57 Avg(%)”

SegMaFormer paper · Table 3

“Ours: 83.79 Avg(%)... nnFormer: 86.4 Avg(%)”

SegMaFormer paper · Table 2

“we find that the recipe for state-of-the-art performance is 1) employing CNN-based U-Net models... 2) using the nnU-Net framework...”

Isensee et al., nnU-Net Revisited · Abstract

What holds up

The staged architecture reasoning holds: placing Mamba blocks where token counts are large and self-attention where $N \ll d$ correctly addresses the quadratic bottleneck of 3D volumetric attention. The efficiency claims are valid—2M parameters and 15 GFLOPs is genuinely lightweight compared to the 150M+ parameter models it cites. The addition of 3D-RoPE to patch embeddings is a sensible enhancement for spatial awareness, and the all-MLP decoder design is consistent with prior efficient segmentation work. The paper also correctly cites the growing concern that hybrid CNN-Transformer models derive most performance from convolutions, not attention.

“Mamba-based state-space layers model perform sequence mixing with a computational cost of $\mathcal{O}(N/r \cdot d^2)$... self-attention becomes particularly effective, as the reduced sequence length $N$ significantly lowers its quadratic computational burden”

SegMaFormer paper · Section 2

“parameters outside of Transformer blocks in Transformer-CNN hybrids are the primary driver of performance”

Wald et al., Primus · Section 2.1

Main concerns

The central flaw is selective benchmarking against weaker baselines while omitting the actual SOTA. The paper does not report nnU-Net numbers, despite Isensee et al.'s comprehensive study showing properly configured nnU-Net outperforms nearly all Transformer and Mamba variants. On Synapse, SegMaFormer particularly struggles with small organs (gallbladder 57.29 vs. U-Mamba's 73.80, pancreas 70.57 vs. 79.3), confirming the authors' own admission that Mamba-based architectures suffer on fine-grained structures. The '75× fewer parameters' comparison is technically true versus nnFormer but misleading as it compares against an attention-heavy model known to be suboptimal. The BraTS dataset is explicitly criticized in nnU-Net Revisited as having 'low systematic variance' and being unsuitable for methodological benchmarking, yet it constitutes one-third of the evaluation.

“GAL: 57.29... PAN: 70.57”

SegMaFormer paper · Table 3

“neither of the two datasets BTCV and BraTS... provide a reliable foundation for assessing general methodological advancements. This is due to a high statistical variance (BTCV) and a low systematic variance (BraTS)”

Isensee et al., nnU-Net Revisited · Section 2.2

“Mamba-based architectures have witnessed a notable performance drop in small organs”

SegMaFormer paper · Section 3.2

Evidence and comparison

The comparisons to SegFormer3D are fair (same pipeline, no pretraining), and SegMaFormer does consistently outperform this baseline by 1–2 Dice points. However, comparisons to UNETR, TransUNet, and TransBTS are largely irrelevant—these are known to be suboptimal baselines that PRIMUS and nnU-Net Revisited have already shown to underperform against modern CNNs. The paper fails to acknowledge that its 'competitive' results still trail the actual state-of-the-art by significant margins (1–4 Dice points). No statistical significance testing is provided, and the use of dual RTX 4060Ti GPUs (an unusual hardware choice) raises questions about training stability versus standard A100/V100 setups used in comparable works.

“All experiments are conducted on a dual NVIDIA RTX 4060Ti GPU”

SegMaFormer paper · Section 3

“UNet Index... UNETR: 0.26, TransUNet: 0.70, nnFormer: 0.47”

Wald et al., Primus · Table 1

Reproducibility

The paper adopts the nnUNet framework for training but does not follow nnUNet's automatic configuration paradigm, instead using fixed hyperparameters (AdamW, lr=3e-4, cosine annealing). This makes fair reproduction difficult because nnUNet's strength lies in its dataset-specific optimization. No code or trained weights are released, and critical implementation details—such as the exact Mamba block configuration (expansion ratio, state dimension), 3D-RoPE frequency settings, and whether depth-wise convolutions are used—are vague. The efficiency numbers (15.2 GFLOPs) are reported but not validated against actual inference timing on standard hardware. The paper mentions using 'optional Deep Supervision' without specifying when it is enabled, a significant confounder for reproducibility.

“We adopt the nnUnet framework setup for training... A weighted combination of Dice and Cross-Entropy losses is utilized... Training uses the AdamW optimizer with a base learning rate of 3e-4”

SegMaFormer paper · Section 3

“this work implements optional Deep Supervision (DS) auxiliary heads... for tasks involving small anatomical structures, it can be observed that such auxiliary supervision does not improve performance”

SegMaFormer paper · Section 2, Decoder

Abstract

The advent of Transformer and Mamba-based architectures has significantly advanced 3D medical image segmentation by enabling global contextual modeling, a capability traditionally limited in Convolutional Neural Networks (CNNs). However, state-of-the-art Transformer models often entail substantial computational complexity and parameter counts, which is particularly prohibitive for volumetric data and further exacerbated by the limited availability of annotated medical imaging datasets. To address these limitations, this work introduces SegMaFormer, a lightweight hybrid architecture that synergizes Mamba and Transformer modules within a hierarchical volumetric encoder for efficient long-range dependency modeling. The model strategically employs Mamba-based layers in early, high-resolution stages to reduce computational overhead while capturing essential spatial context, and reserves self-attention mechanisms for later, lower-resolution stages to refine feature representation. This design is augmented with generalized rotary position embeddings to enhance spatial awareness. Despite its compact structure, SegMaFormer achieves competitive performance on three public benchmarks (Synapse, BraTS, and ACDC), matching the Dice coefficient of significantly larger models. Empirically, our approach reduces parameters by up to 75x and substantially decreases FLOPs compared to current state-of-the-art models, establishing an efficient and high-performing solution for 3D medical image segmentation.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.