Explainable Semantic Textual Similarity via Dissimilar Span Detection

cs.CL Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser · Mar 22, 2026
Local to this browser
What it does
The paper introduces Dissimilar Span Detection (DSD), a new task aimed at explaining Semantic Textual Similarity (STS) scores by identifying specific text spans that differ in meaning between sentence pairs. To enable this research, the...
Why it matters
To enable this research, the authors release the Span Similarity Dataset (SSD), containing 1,000 semi-automatically annotated samples validated by human annotators. They evaluate a broad range of approaches—including LIME, SHAP,...
Main concern
The paper presents a well-motivated contribution to explainable NLP, introducing a novel task (DSD) and a carefully constructed dataset (SSD) with substantial inter-annotator agreement ($\kappa=0. 87$ for span labels).
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

The paper introduces Dissimilar Span Detection (DSD), a new task aimed at explaining Semantic Textual Similarity (STS) scores by identifying specific text spans that differ in meaning between sentence pairs. To enable this research, the authors release the Span Similarity Dataset (SSD), containing 1,000 semi-automatically annotated samples validated by human annotators. They evaluate a broad range of approaches—including LIME, SHAP, proprietary LLMs, and supervised token classifiers—and find that while LLMs achieve the highest performance, the task remains challenging even for state-of-the-art models, with potential applications in paraphrase detection and fact-checking.

Critical review
Verdict
Bottom line

The paper presents a well-motivated contribution to explainable NLP, introducing a novel task (DSD) and a carefully constructed dataset (SSD) with substantial inter-annotator agreement ($\kappa=0.87$ for span labels). The experimental design is comprehensive, comparing model-agnostic explainers, embedding-based methods, and large language models. However, the overall performance remains modest—even the best LLM (Claude 3.5 Sonnet) only achieves an F1-Global of 0.750—indicating that the task is significantly more difficult than anticipated. While the downstream paraphrase detection experiment shows promise (up to 8 point accuracy improvement), the authors candidly note that current methods remain unsuitable for real-world deployment.

“LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task.”
Lozano et al., Abstract · Abstract
“LLM Claude 3.5 Sonnet ... 0.750”
Lozano et al., Table 4 · Table 4
What holds up

The dataset construction methodology is rigorous and transparent, employing a semi-automated pipeline combining GPT-3.5-Turbo with human verification to achieve high annotation quality efficiently. The proposed Embedding-DSD method offers a practical trade-off between explainability and performance, outperforming both LIME and SHAP baselines while being orders of magnitude faster (11 minutes vs. 1982 minutes for LIME on the evaluation set). The downstream evaluation on PAWS-Wiki Labeled provides concrete evidence of utility, demonstrating that incorporating DSD can improve paraphrase detection accuracy without requiring model retraining.

“annotating 100 samples this way takes between 1 and 2 hours”
Lozano et al., Section 3.2 · Section 3.2
“Embedding all-mpnet-base-v2 ... 0.469 ... 11.28 ... LIME all-mpnet-base-v2 (0.001) ... 0.463 ... 1981.81”
Lozano et al., Table 4 · Table 4
“the introduction of DSD improves accuracy up to 8 points”
Lozano et al., Section 5 · Section 5
Main concerns

The evaluation on SemEval-2016 Task 2 is explicitly acknowledged as a lower bound because the dataset only marks 'opposite' (OPPO) spans as dissimilar, meaning models are penalized for correctly identifying other valid dissimilarities not labeled in the gold standard. This creates an unfair comparison where the ground truth itself is incomplete for the task definition. Additionally, the task relies on the concept of 'common semantic function' between spans, but the paper provides limited operational criteria beyond the examples shown, potentially limiting reproducibility of the annotation scheme. The authors also acknowledge that embedding-based models show a 'general lack of sensitivity... towards localized textual changes,' suggesting the core STS models underlying many approaches may be fundamentally ill-suited for fine-grained dissimilarity detection.

“The results for the SemEval-2016 data must be taken as a lower bound, since we only considered spans labeled as opposite.”
Lozano et al., Section 5 · Section 5
“methods that rely on embedding models show the general lack of sensitivity of such models towards localized textual changes”
Lozano et al., Section 5 · Section 5
Evidence and comparison

The evidence supports the central claim that DSD is a difficult task, with even top-tier LLMs struggling to exceed 0.75 F1 on a simplified sentence-level benchmark. The comparison between explanation methods is internally consistent, showing Embedding-DSD (F1-Global 0.547 with text-embedding-004) outperforming SHAP (0.366) and LIME (0.463) variants. However, the comparison across model scales lacks normalization for computational cost or carbon footprint—comparing 22.7M parameter Sentence Transformers against billion-parameter proprietary APIs without discussing inference cost or latency trade-offs makes it difficult to assess practical viability. Furthermore, the No-DSD baseline achieves 0.429 F1-Global simply by predicting no differences, indicating significant class imbalance that complicates interpretation of absolute scores.

“Embedding text-embedding-004 (0.005) ... 0.547 ... SHAP all-MiniLM-L6-v2 (0.010) ... 0.366 ... LIME all-mpnet-base-v2 (0.001) ... 0.463”
Lozano et al., Table 4 · Table 4
“No-DSD ... 0.429 ... Naive-DSD ... 0.311”
Lozano et al., Table 4 · Table 4
Reproducibility

The paper demonstrates strong open-science practices: the source code and dataset are publicly released under permissive licenses (GPL v3.0 and CC BY-SA 4.0) with a provided URL. Experimental hyperparameters for fine-tuning are clearly stated (learning rate $\eta=5\cdot10^{-5}$, weight decay $5\cdot10^{-3}$, batch size 8) alongside the 5-fold cross-validation protocol. However, exact reproduction of LLM results may be challenging due to reliance on proprietary APIs (OpenAI, Anthropic) subject to version drift and rate limiting. Additionally, the dissimilarity thresholds for unsupervised methods appear to be tuned per-model (ranging from 0.001 to 0.030), and while Appendix C provides pseudocode for Embedding-DSD, the exact threshold selection criteria are not fully automated or described.

“The source code and the dataset are publicly available at https://dmlls.github.io/dissimilar-span-detection, licensed under GPL v3.0 and CC BY-SA 4.0”
Lozano et al., Section 1 · Section 1
“We use cross-entropy loss, with a learning rate $\eta=5\cdot10^{-5}$, a weight decay of $5\cdot10^{-3}$, and a batch size of 8”
Lozano et al., Section 4.1 · Section 4.1
Abstract

Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.