STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection

cs.CV Jianlin Chen, Gongyang Li, Zhijiang Zhang, Liang Chang, Dan Zeng · Mar 23, 2026
Local to this browser
What it does
The paper addresses the quadratic complexity of transformer attention and limited local detail extraction in RGB-D Salient Object Detection (SOD). It proposes STENet, which introduces superpixels as intermediate tokens to reduce...
Why it matters
It proposes STENet, which introduces superpixels as intermediate tokens to reduce computational overhead while preserving structural coherence. The core idea replaces global pixel-to-pixel attention with two modules: one for...
Main concern
STENet presents a technically sound approach to reducing attention complexity from $O(N^2)$ to roughly $O(MN)$ where $M \ll N$ superpixels represent pixel regions. The ablation studies demonstrate clear incremental benefits from both the...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

The paper addresses the quadratic complexity of transformer attention and limited local detail extraction in RGB-D Salient Object Detection (SOD). It proposes STENet, which introduces superpixels as intermediate tokens to reduce computational overhead while preserving structural coherence. The core idea replaces global pixel-to-pixel attention with two modules: one for pixel-to-superpixel global enhancement and another for intra-superpixel local refinement, aiming to balance efficiency and accuracy.

Critical review
Verdict
Bottom line

STENet presents a technically sound approach to reducing attention complexity from $O(N^2)$ to roughly $O(MN)$ where $M \ll N$ superpixels represent pixel regions. The ablation studies demonstrate clear incremental benefits from both the global (SAGEM) and local (SALRM) modules. However, the paper's efficiency claims are somewhat clouded by the fact that superpixel generation itself requires $T=2$ iterations of local cross-attention, the cost of which is not fully integrated into the complexity comparisons with competing methods. While the method achieves competitive accuracy on seven benchmarks, the advantage over recent efficient transformers like CAVER is marginal on certain datasets, suggesting the gains may be dataset-dependent rather than transformative.

What holds up

The ablation evidence strongly supports the complementary roles of global and local superpixel processing. Table III shows that combining SAGEM and SALRM yields the best average $F_m$ (0.921) and $E_m$ (0.949), with the synergy being particularly pronounced on the complex SIP dataset ($F_m^\omega$ improves from 0.885 to 0.911). The superpixel generation's expanded neighborhood (5×5 masking) demonstrates improved semantic discrimination over standard 3×3 constraints (Table VI), validating the claim that "expanding to 5×5 allows pixels to search for similar superpixels over a wider spatial range." The efficiency metrics are genuine: with Swin-B backbone, STENet achieves 180.5M parameters versus CPNet's 216.5M while maintaining comparable accuracy.

“+SAGEM+SALRM (Ours) achieves 0.023 MAE on NJUD versus 0.029 for baseline”
paper · Table III
“expanding to 5×5 allows pixels to search for similar superpixels over a wider spatial range”
paper · Section III-B
“Ours with Swin-B has 180.5M parameters and 118.3G FLOPs”
paper · Table I
Main concerns

First, the complexity analysis is incomplete. While SAGEM reduces the attention map size to $M \times HW$, the superpixel generation involves $T=2$ iterations of local cross-attention with top-k masking operations that are not free; yet the paper only reports final FLOPs without dissecting the overhead of superpixel creation versus standard partitioning. The claim that "the selective cross-modal fusion mechanism reduces redundant computations by 37.2\% compared to CAVER" is metric-specific and not uniformly supported across all datasets. Second, the comparison with CAVER's "parameter-free spatial attention" is somewhat asymmetric—CAVER avoids learned parameters entirely for spatial mixing, whereas STENet introduces substantial superpixel generation machinery. Third, the dimensional notation in equations (4)-(5) appears inconsistent: $P_R^i$ is defined via $K_{SR}^i \otimes Q_{SR}^{i\top}$ (both superpixel terms) but then used to "propagate information back into the original pixel space," suggesting a dimensional mismatch or notation error that hinders reproducibility.

“Through $T$ iterations of this process... $T$ is set to 2”
paper · Section III-B
“reduces redundant computations by 37.2\% compared to CAVER”
paper · Section IV-B.3
“utilizing superpixel keys and pixel queries... obtaining attention maps $P_R^i$”
paper · Section III-C
Evidence and comparison

The evidence supports the superiority of superpixel-based attention over naive channel-only fusion, as shown by SAGEM+SALRM outperforming CAF+SALRM by 0.5\% $F_\beta$ (Table III). However, the paper understates that CAVER achieves nearly identical $E_m$ scores on NLPR (0.963 vs. 0.964) with far fewer FLOPs on certain configurations. The failure case analysis (Fig. 8) honestly acknowledges limitations with noisy depth maps and weak-texture scenes, though the suggested fixes (cross-modal consistency constraints) are vague. The comparison to K-means and self-attention superpixel methods (Table V) rigorously validates their specific cross-attention generation approach, showing 0.006 MAE improvement on SIP over cross-attention baselines.

“Ours achieves 0.032 MAE on SIP versus 0.033 for cross-attention based method”
paper · Table V
“noisy depth maps make our method difficult to distinguish foreground from background”
paper · Figure 8
“+CAF+SALRM achieves 0.936 $F_m$ on NJUD versus 0.941 for +SAGEM+SALRM”
paper · Table III
Reproducibility

The paper provides standard implementation details: PyTorch, Swin-B pre-trained on ImageNet, Adam optimizer with cosine annealing ($lr=5e^{-5}$, batch=5), and 384×384 input resolution. However, critical details for reproducing the superpixel generation are missing: the exact initialization strategy for superpixel features $S$, the specific top-k selection mechanism implementation (gather/scatter operations), and the gradient flow through the argmax-based masking. The paper states "we adopt the local cross-attention [75] to generate superpixel tokens" but does not clarify if this is fully differentiable or requires straight-through estimators. No code repository is linked in the provided text, and the exactsplit of 700 NLPR + 1485 NJUD + 800 DUT-RGBD training images, while cited from prior work, lacks checksums or file lists to ensure identical data preprocessing.

“We implement STENet using PyTorch... inputs are resized to 384×384”
paper · Section IV-A.2
“we adopt the local cross-attention [75] to generate superpixel tokens”
paper · Section III-B
“700 NLPR, 1485 NJUD, and 800 DUT-RGBD images”
paper · Section IV-A.1
Abstract

Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.