FISformer: Replacing Self-Attention with a Fuzzy Inference System in Transformer Models for Time Series Forecasting
FISformer proposes replacing the dot-product self-attention in Transformers with a Sugeno-type Fuzzy Inference System (FIS) for time series forecasting. Instead of computing query-key similarities, the model fuzzifies tokens using learnable Gaussian membership functions, applies fuzzy rules, and defuzzifies to produce interaction weights. The paper suggests this approach captures uncertainty and nonlinearity better than standard attention, reporting state-of-the-art results on benchmarks like ETT, ECL, and Weather.
The paper presents an interesting hybrid of fuzzy logic and deep learning, but its central claims are weakened by methodological inconsistencies and insufficient experimental rigor. The FIS interaction mechanism does not actually replace the pairwise token-token attention computation as claimed; instead, it produces per-token weights per dimension that are element-wise multiplied with values, fundamentally altering the architecture's relational modeling capacity. The complexity analysis comparing $\mathcal{O}(T^2 d)$ vs $\mathcal{O}(T d R)$ is misleading because the mechanisms compute different quantities. The empirical evaluation lacks critical baselines like DLinear (which the cited Zeng et al. showed outperforms many Transformers), uses only 10 training epochs without variance reporting, and claims noise robustness and interpretability without supporting experiments.
The core intellectual contribution—embedding a differentiable Sugeno FIS into the Transformer architecture—is novel and technically sound. The fuzzification procedure using learnable Gaussian MFs (Eq. 5), product-based rule firing (Eq. 6), and normalized defuzzification (Eq. 7-10) forms a coherent, end-to-end differentiable fuzzy reasoning system. The paper correctly identifies that standard dot-product attention produces deterministic scores and that fuzzy logic offers natural uncertainty modeling. The ablation in Table VII showing FIS interaction improves Informer when swapped into ProbSparse attention demonstrates the mechanism's potential portability across architectures.
The paper's fundamental claim of replacing 'token-to-token affinities' is misleading. Standard self-attention computes pairwise scores $A \in \mathbb{R}^{T \times T}$ enabling global token interaction via $H' = AV$. FISformer produces weights $A \in \mathbb{R}^{T \times d}$ (Eq. 11) applied via element-wise multiplication $O_{i,j} = A_{i,j} \cdot V_{i,j}$ (Eq. 12), which acts as feature-wise gating without cross-token aggregation. This is not an attention mechanism but a fuzzy feature modulation, eliminating the Transformer's capacity to aggregate information across positions in a single layer. The complexity comparison (Table IX) is therefore apples-to-oranges: achieving linear complexity $\mathcal{O}(TdR)$ by removing pairwise interactions is not a computational saving but a representational limitation. Furthermore, the paper claims 'superior forecasting accuracy' yet omits comparison with DLinear and other simple baselines that the cited Zeng et al. (2023) showed outperform complex Transformers. Most critically, claims of 'noise robustness' and 'interpretability' (highlighted in the abstract and contributions) are never experimentally validated—no noise injection experiments are conducted, and no rule visualization or linguistic interpretation of learned fuzzy sets is provided.
The experimental protocol raises significant concerns. Training for only 10 epochs with a fixed batch size of 32 (Section IV-B) is insufficient for convergence on many of these benchmarks, and the lack of standard deviation reporting across multiple runs makes statistical significance impossible to assess. The comparison with FANTF (Table V) shows large improvements, but FANTF's fuzzy mechanism is underspecified in both papers. The paper cites Zeng et al.'s critical finding that 'standard designs may underperform compared to simpler models' yet fails to include the DLinear baseline that demonstrated this finding, potentially inflating the apparent superiority of FISformer. The inverted architecture (variate-as-token) adopted from iTransformer means the FIS interaction is applied across variates (channels), not time steps, modeling cross-variable rather than temporal dependencies—a distinction not clearly communicated when claiming general forecasting superiority.
Reproducibility is severely limited. No code repository or implementation details beyond basic hyperparameter ranges are provided. Critical hyperparameters such as the number of membership functions ($R=3$) appear fixed without ablation. The training protocol uses a single GPU with unspecified random seeds. The paper does not report inference time measurements despite complexity claims. Table II reports average results across prediction lengths, but the raw data in Tables III-IV shows FISformer underperforms iTransformer on some individual horizons (e.g., ETTh1 at 96 steps: 0.378 vs 0.386 MSE is actually worse for FISformer, contrary to the 'best on 6 of 7' claim when considering all metrics and horizons). Without open-sourced code, exact reproduction of the FIS interaction layer's implementation—particularly the backpropagation through the Sugeno defuzzification—is impractical.
Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.