Explainable Semantic Textual Similarity via Dissimilar Span Detection
The paper introduces Dissimilar Span Detection (DSD), a new task aimed at explaining Semantic Textual Similarity (STS) scores by identifying specific text spans that differ in meaning between sentence pairs. To enable this research, the authors release the Span Similarity Dataset (SSD), containing 1,000 semi-automatically annotated samples validated by human annotators. They evaluate a broad range of approaches—including LIME, SHAP, proprietary LLMs, and supervised token classifiers—and find that while LLMs achieve the highest performance, the task remains challenging even for state-of-the-art models, with potential applications in paraphrase detection and fact-checking.
The paper presents a well-motivated contribution to explainable NLP, introducing a novel task (DSD) and a carefully constructed dataset (SSD) with substantial inter-annotator agreement ($\kappa=0.87$ for span labels). The experimental design is comprehensive, comparing model-agnostic explainers, embedding-based methods, and large language models. However, the overall performance remains modest—even the best LLM (Claude 3.5 Sonnet) only achieves an F1-Global of 0.750—indicating that the task is significantly more difficult than anticipated. While the downstream paraphrase detection experiment shows promise (up to 8 point accuracy improvement), the authors candidly note that current methods remain unsuitable for real-world deployment.
The dataset construction methodology is rigorous and transparent, employing a semi-automated pipeline combining GPT-3.5-Turbo with human verification to achieve high annotation quality efficiently. The proposed Embedding-DSD method offers a practical trade-off between explainability and performance, outperforming both LIME and SHAP baselines while being orders of magnitude faster (11 minutes vs. 1982 minutes for LIME on the evaluation set). The downstream evaluation on PAWS-Wiki Labeled provides concrete evidence of utility, demonstrating that incorporating DSD can improve paraphrase detection accuracy without requiring model retraining.
The evaluation on SemEval-2016 Task 2 is explicitly acknowledged as a lower bound because the dataset only marks 'opposite' (OPPO) spans as dissimilar, meaning models are penalized for correctly identifying other valid dissimilarities not labeled in the gold standard. This creates an unfair comparison where the ground truth itself is incomplete for the task definition. Additionally, the task relies on the concept of 'common semantic function' between spans, but the paper provides limited operational criteria beyond the examples shown, potentially limiting reproducibility of the annotation scheme. The authors also acknowledge that embedding-based models show a 'general lack of sensitivity... towards localized textual changes,' suggesting the core STS models underlying many approaches may be fundamentally ill-suited for fine-grained dissimilarity detection.
The evidence supports the central claim that DSD is a difficult task, with even top-tier LLMs struggling to exceed 0.75 F1 on a simplified sentence-level benchmark. The comparison between explanation methods is internally consistent, showing Embedding-DSD (F1-Global 0.547 with text-embedding-004) outperforming SHAP (0.366) and LIME (0.463) variants. However, the comparison across model scales lacks normalization for computational cost or carbon footprint—comparing 22.7M parameter Sentence Transformers against billion-parameter proprietary APIs without discussing inference cost or latency trade-offs makes it difficult to assess practical viability. Furthermore, the No-DSD baseline achieves 0.429 F1-Global simply by predicting no differences, indicating significant class imbalance that complicates interpretation of absolute scores.
The paper demonstrates strong open-science practices: the source code and dataset are publicly released under permissive licenses (GPL v3.0 and CC BY-SA 4.0) with a provided URL. Experimental hyperparameters for fine-tuning are clearly stated (learning rate $\eta=5\cdot10^{-5}$, weight decay $5\cdot10^{-3}$, batch size 8) alongside the 5-fold cross-validation protocol. However, exact reproduction of LLM results may be challenging due to reliance on proprietary APIs (OpenAI, Anthropic) subject to version drift and rate limiting. Additionally, the dissimilarity thresholds for unsupervised methods appear to be tuned per-model (ranging from 0.001 to 0.030), and while Appendix C provides pseudocode for Embedding-DSD, the exact threshold selection criteria are not fully automated or described.
Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.