Towards Multimodal Time Series Anomaly Detection with Semantic Alignment and Condensed Interaction
MindTS tackles multimodal time series anomaly detection by fusing numerical time series with text from two sources: endogenous text (LLM-generated descriptions of patch statistics) and exogenous text (external reports). The core idea is to align these heterogeneous modalities via contrastive learning and filter textual redundancy using an Information Bottleneck-inspired content condenser before cross-modal reconstruction. This matters because real-world anomalies often manifest in contextual text (e.g., policy changes affecting stock prices) that pure numerical models miss.
MindTS introduces a plausible architecture for integrating text and time series, but the paper suffers from reproducibility gaps, unfair baseline comparisons, and overstated claims. While the ablation studies validate the individual components (Figure 3), the main results (Table 1) compare a multimodal method against unimodal baselines without controlling for the information advantage of text access, exaggerating the reported gains. When compared fairly against other multimodal frameworks (Table 2), the improvements are marginal and lack statistical significance testing. The reliance on proprietary LLM APIs for endogenous text generation and the use of custom datasets with undocumented web-crawling procedures further undermine the reproducibility of the claimed state-of-the-art results.
The ablation studies (Figure 3) provide reasonable evidence that both text views (endogenous and exogenous) contribute complementary information, as removing either degrades performance. The content condenser mechanism—minimizing $I(\mathbf{Z}_{\text{text}}; \mathbf{Z}_{\text{con}})$ via the upper bound in Lemma 1—is theoretically grounded in the Information Bottleneck principle (Tishby et al., 2000). The smoothness loss $\mathcal{L}_{SM}=\frac{1}{N}\sum_{i=1}^{N-1}\sqrt{(\psi_{i+1}-\psi_{i})^{2}}$ addresses a genuine concern about temporal discontinuity in the Bernoulli masking process, and the paper correctly notes that filtering before alignment performs worse than filtering after (ablation f), validating the architectural ordering.
The paper makes unfair comparisons by pitting MindTS (which uses both time series and text) against unimodal methods that use only numerical data (Table 1), rendering the 'SOTA' claim misleading. Table 2 shows that against multimodal baselines using MM-TSFLib, the gains are often within 1-3% and lack variance estimates or significance tests. The endogenous text generation relies on LLM prompts for patch statistics (mean, extrema, trend) described only in Appendix G, introducing non-determinism and prohibitive API costs that block reproduction. The datasets (KR, EWJ, MDT) involve custom web-crawling with vague 'domain-specific keywords' and manual filtering for 'factual content' (Appendix A.1), making them impossible to replicate exactly. The claim that exogenous text is 'easy to obtain' is domain-dependent and contradicts the acknowledged challenge that 'external information sources are often scattered.'
The evidence supports that multimodal information helps anomaly detection, but the magnitude of improvement is overstated. In Table 2, MindTS (74.37 Aff-F on Energy) shows only marginal improvement over Modern* (72.13) and iTrans* (72.49) when all use text. The comparison to MM-TSFLib is particularly important because both use the same input modalities; here MindTS wins but by narrow margins that may not survive statistical testing. The paper uses Aff-F1 and VUS metrics, which are less common than point-adjusted F1 or AUC-ROC used in prior work (e.g., DCdetector, Anomaly Transformer), making direct comparison with the broader literature difficult. The claim that performance stems from 'architectural design rather than reliance on specific LLMs' (Table 3) is weakly supported because all tested LLMs (GPT-2, BERT, LLaMA, DeepSeek) provide similar semantic capabilities; a stronger test would compare against non-semantic baselines.
Reproducibility is severely limited. The paper provides no training hyperparameters (learning rate, batch size, epochs) in the main text, only noting 'implementation details are presented in Appendix A.4.' The endogenous text generation relies on 'specifically designed prompts' (Section 3.1) relegated to Appendix G, and the LLM outputs are non-deterministic even with temperature control. The datasets require proprietary web crawling with manual filtering steps ('only factual content is retained') that cannot be exactly replicated. While the code is promised at a GitHub URL, the hardware requirements, training time, and random seeds are omitted. The 'Reproducibility statement' claims 'all experimental results can be reproduced' but offers no protocol or seed information to support this assertion.
Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.