FISformer: Replacing Self-Attention with a Fuzzy Inference System in Transformer Models for Time Series Forecasting

cs.LG cs.AI Bulent Haznedar, Levent Karacan · Mar 23, 2026
Local to this browser
What it does
FISformer proposes replacing the dot-product self-attention in Transformers with a Sugeno-type Fuzzy Inference System (FIS) for time series forecasting. Instead of computing query-key similarities, the model fuzzifies tokens using...
Why it matters
Instead of computing query-key similarities, the model fuzzifies tokens using learnable Gaussian membership functions, applies fuzzy rules, and defuzzifies to produce interaction weights. The paper suggests this approach captures...
Main concern
The paper presents an interesting hybrid of fuzzy logic and deep learning, but its central claims are weakened by methodological inconsistencies and insufficient experimental rigor. The FIS interaction mechanism does not actually replace...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

FISformer proposes replacing the dot-product self-attention in Transformers with a Sugeno-type Fuzzy Inference System (FIS) for time series forecasting. Instead of computing query-key similarities, the model fuzzifies tokens using learnable Gaussian membership functions, applies fuzzy rules, and defuzzifies to produce interaction weights. The paper suggests this approach captures uncertainty and nonlinearity better than standard attention, reporting state-of-the-art results on benchmarks like ETT, ECL, and Weather.

Critical review
Verdict
Bottom line

The paper presents an interesting hybrid of fuzzy logic and deep learning, but its central claims are weakened by methodological inconsistencies and insufficient experimental rigor. The FIS interaction mechanism does not actually replace the pairwise token-token attention computation as claimed; instead, it produces per-token weights per dimension that are element-wise multiplied with values, fundamentally altering the architecture's relational modeling capacity. The complexity analysis comparing $\mathcal{O}(T^2 d)$ vs $\mathcal{O}(T d R)$ is misleading because the mechanisms compute different quantities. The empirical evaluation lacks critical baselines like DLinear (which the cited Zeng et al. showed outperforms many Transformers), uses only 10 training epochs without variance reporting, and claims noise robustness and interpretability without supporting experiments.

“The resulting Fuzzy Interaction Map \mathbf{F}_{QK} contains the inferred relational strengths across all token dimensions... For each dimension, a softmax operation is applied along the token axis to obtain normalized interaction weights”
paper · Section III-B
“training is performed for 10 epochs”
paper · Section IV-B
What holds up

The core intellectual contribution—embedding a differentiable Sugeno FIS into the Transformer architecture—is novel and technically sound. The fuzzification procedure using learnable Gaussian MFs (Eq. 5), product-based rule firing (Eq. 6), and normalized defuzzification (Eq. 7-10) forms a coherent, end-to-end differentiable fuzzy reasoning system. The paper correctly identifies that standard dot-product attention produces deterministic scores and that fuzzy logic offers natural uncertainty modeling. The ablation in Table VII showing FIS interaction improves Informer when swapped into ProbSparse attention demonstrates the mechanism's potential portability across architectures.

“For the rule generation stage, we adopt a product-based t-norm to compute the activation strength of each fuzzy rule... The firing strength \pi_{i,j,r} is computed as: \pi_{i,j,r}=\mu_{q}(q_{i,j};r)\cdot\mu_{k}(k_{i,j};r)”
paper · Section III-B
“we integrated FIS interaction into the Informer architecture... results demonstrate that FIS interaction generalizes well across architectures”
paper · Section IV-D
Main concerns

The paper's fundamental claim of replacing 'token-to-token affinities' is misleading. Standard self-attention computes pairwise scores $A \in \mathbb{R}^{T \times T}$ enabling global token interaction via $H' = AV$. FISformer produces weights $A \in \mathbb{R}^{T \times d}$ (Eq. 11) applied via element-wise multiplication $O_{i,j} = A_{i,j} \cdot V_{i,j}$ (Eq. 12), which acts as feature-wise gating without cross-token aggregation. This is not an attention mechanism but a fuzzy feature modulation, eliminating the Transformer's capacity to aggregate information across positions in a single layer. The complexity comparison (Table IX) is therefore apples-to-oranges: achieving linear complexity $\mathcal{O}(TdR)$ by removing pairwise interactions is not a computational saving but a representational limitation. Furthermore, the paper claims 'superior forecasting accuracy' yet omits comparison with DLinear and other simple baselines that the cited Zeng et al. (2023) showed outperform complex Transformers. Most critically, claims of 'noise robustness' and 'interpretability' (highlighted in the abstract and contributions) are never experimentally validated—no noise injection experiments are conducted, and no rule visualization or linguistic interpretation of learned fuzzy sets is provided.

“Oi,j=Ai,j\cdot Vi,j,\quad\text{for }i=1,2,\ldots,T,\;j=1,2,\ldots,d”
paper · Section III-B
“superior forecasting accuracy, noise robustness, and interpretability”
paper · Abstract
“Time Complexity: Self-Attention \mathcal{O}(T^{2}\cdot d), FIS Interaction \mathcal{O}(T\cdot d\cdot R)”
paper · Table IX
Evidence and comparison

The experimental protocol raises significant concerns. Training for only 10 epochs with a fixed batch size of 32 (Section IV-B) is insufficient for convergence on many of these benchmarks, and the lack of standard deviation reporting across multiple runs makes statistical significance impossible to assess. The comparison with FANTF (Table V) shows large improvements, but FANTF's fuzzy mechanism is underspecified in both papers. The paper cites Zeng et al.'s critical finding that 'standard designs may underperform compared to simpler models' yet fails to include the DLinear baseline that demonstrated this finding, potentially inflating the apparent superiority of FISformer. The inverted architecture (variate-as-token) adopted from iTransformer means the FIS interaction is applied across variates (channels), not time steps, modeling cross-variable rather than temporal dependencies—a distinction not clearly communicated when claiming general forecasting superiority.

“training is performed for 10 epochs... A fixed batch size of 32 is used across all experiments”
paper · Section IV-B
“Zeng et al. critically evaluate transformer models for time series forecasting and highlight their limitations, showing that standard designs may underperform compared to simpler models”
paper · Section II-A
Reproducibility

Reproducibility is severely limited. No code repository or implementation details beyond basic hyperparameter ranges are provided. Critical hyperparameters such as the number of membership functions ($R=3$) appear fixed without ablation. The training protocol uses a single GPU with unspecified random seeds. The paper does not report inference time measurements despite complexity claims. Table II reports average results across prediction lengths, but the raw data in Tables III-IV shows FISformer underperforms iTransformer on some individual horizons (e.g., ETTh1 at 96 steps: 0.378 vs 0.386 MSE is actually worse for FISformer, contrary to the 'best on 6 of 7' claim when considering all metrics and horizons). Without open-sourced code, exact reproduction of the FIS interaction layer's implementation—particularly the backpropagation through the Sugeno defuzzification—is impractical.

“we utilize three Gaussian membership functions per feature”
paper · Section III-B
“FISformer achieves the best performance on 6 out of 7 datasets for both MSE and MAE”
paper · Section IV-C
“ETTh1 96: FISformer MSE 0.378, iTransformer MSE 0.386”
paper · Table III
Abstract

Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.