Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
91 papers in cs.CL
Trending mixes fresh papers with community signal.
0
cs.LGcs.AIcs.CL Guhao Feng, Shengjie Luo, Kai Hua et al. · Apr 7, 2026

This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.

The static ``train then deploy" paradigm fundamentally limits Large Language Models (LLMs) from dynamically adapting their weights in response to continuous streams of new information inherent in real-world tasks. Test-Time Training (TTT) offers a compelling alternative by updating a subset of model parameters (fast weights) at inference time, yet its potential in the current LLM ecosystem is hindered by critical barriers including architectural incompatibility, computational inefficiency and misaligned fast weight objectives for language modeling. In this work, we introduce In-Place Test-Time Training (In-Place TTT), a framework that seamlessly endows LLMs with Test-Time Training ability. In-Place TTT treats the final projection matrix of the ubiquitous MLP blocks as its adaptable fast weights, enabling a ``drop-in" enhancement for LLMs without costly retraining from scratch. Furthermore, we replace TTT's generic reconstruction objective with a tailored, theoretically-grounded objective explicitly aligned with the Next-Token-Prediction task governing autoregressive language modeling. This principled objective, combined with an efficient chunk-wise update mechanism, results in a highly scalable algorithm compatible with context parallelism. Extensive experiments validate our framework's effectiveness: as an in-place enhancement, it enables a 4B-parameter model to achieve superior performance on tasks with contexts up to 128k, and when pretrained from scratch, it consistently outperforms competitive TTT-related approaches. Ablation study results further provide deeper insights on our design choices. Collectively, our results establish In-Place TTT as a promising step towards a paradigm of continual learning in LLMs.
0
cs.AIcs.CLcs.AI Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan et al. · Mar 31, 2026

GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$). The authors find that current LLMs struggle primarily with recall—discovering all verifiable interests—rather than hallucination, revealing critical bottlenecks in evidence counting across heterogeneous signal types.

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
0
cs.AIcs.CLcs.AI v2 Alex L. Zhang, Tim Kraska, Omar Khattab · Dec 31, 2025

Recursive Language Models (RLMs) tackle the long-context problem by treating prompts as external environment variables that an LLM can programmatically manipulate through a REPL. Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of magnitude beyond typical context windows—while maintaining strong performance on complex aggregation tasks.

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
0
cs.LGcs.CL Xixi Wu, Qianguo Sun, Ruiyang Zhang et al. · Mar 23, 2026

This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.
0
cs.CL Zehua Pei, Hui-Ling Zhen, Weizhe Lin et al. · Mar 23, 2026

Diffusion Language Models (DLMs) train with a static single-step masked prediction objective but infer via multi-step progressive denoising, creating a train-inference mismatch that compounds errors. MemDLM bridges this gap through Bi-level Optimization: an inner loop updates fast weights (Parametric Memory) to capture local trajectory experience, while an outer loop conditions the base model on this memory. The approach yields faster convergence, lower exposure bias, and substantial gains on long-context needle-in-a-haystack tasks, with an optional inference-time adaptation that acts as an emergent in-weight retrieval mechanism.

Diffusion Language Models (DLMs) offer attractive advantages over Auto-Regressive (AR) models, such as full-attention parallel decoding and flexible generation. However, they suffer from a notable train-inference mismatch: DLMs are trained with a static, single-step masked prediction objective, but deployed through a multi-step progressive denoising trajectory. We propose MemDLM (Memory-Enhanced DLM), which narrows this gap by embedding a simulated denoising process into training via Bi-level Optimization. An inner loop updates a set of fast weights, forming a Parametric Memory that captures the local trajectory experience of each sample, while an outer loop updates the base model conditioned on this memory. By offloading memorization pressure from token representations to parameters, MemDLM yields faster convergence and lower training loss. Moreover, the inner loop can be re-enabled at inference time as an adaptation step, yielding additional gains on long-context understanding. We find that, when activated at inference time, this Parametric Memory acts as an emergent in-weight retrieval mechanism, helping MemDLM further reduce token-level attention bottlenecks on challenging Needle-in-a-Haystack retrieval tasks. Code: https://github.com/JarvisPei/MemDLM.
0
cs.CL Edward Phillips, Fredrik K. Gustafsson, Sean Wu et al. · Mar 22, 2026

Selective prediction systems in LLMs abstain from answering uncertain questions to mitigate hallucination harms in high-stakes domains. This paper identifies a critical failure mode of entropy-based uncertainty quantification: the 'confidently wrong' regime where models produce low-entropy hallucinations. The authors propose combining entropy signals with correctness probes using logistic regression, and advocate for deployment-facing metrics—E-AURC and TCE—over AUROC to ensure systems can reliably operate at strict safety thresholds.

Selective prediction systems can mitigate harms resulting from language model hallucinations by abstaining from answering in high-risk cases. Uncertainty quantification techniques are often employed to identify such cases, but are rarely evaluated in the context of the wider selective prediction policy and its ability to operate at low target error rates. We identify a model-dependent failure mode of entropy-based uncertainty methods that leads to unreliable abstention behaviour, and address it by combining entropy scores with a correctness probe signal. We find that across three QA benchmarks (TriviaQA, BioASQ, MedicalQA) and four model families, the combined score generally improves both the risk--coverage trade-off and calibration performance relative to entropy-only baselines. Our results highlight the importance of deployment-facing evaluation of uncertainty methods, using metrics that directly reflect whether a system can be trusted to operate at a stated risk level.
0
cs.CL Haroun Elleuch, Ryan Whetten, Salima Mdhaffar et al. · Mar 23, 2026

Ara-BEST-RQ introduces dedicated self-supervised speech models for Arabic dialects. The authors curate 5,640 hours of Creative Commons Arabic speech covering 20 dialects and train Conformer-based BEST-RQ models up to 600M parameters. Their 300M model achieves state-of-the-art dialect identification performance using fewer parameters than competing Whisper-based systems. This work helps close the gap for underrepresented Arabic dialects in speech technology.

We present Ara-BEST-RQ, a family of self-supervised learning (SSL) models specifically designed for multi-dialectal Arabic speech processing. Leveraging 5,640 hours of crawled Creative Commons speech and combining it with publicly available datasets, we pre-train conformer-based BEST-RQ models up to 600M parameters. Our models are evaluated on dialect identification (DID) and automatic speech recognition (ASR) tasks, achieving state-of-the-art performance on the former while using fewer parameters than competing models. We demonstrate that family-targeted pre-training on Arabic dialects significantly improves downstream performance compared to multilingual or monolingual models trained on non-Arabic data. All models, code, and pre-processed datasets will be publicly released to support reproducibility and further research in Arabic speech technologies.
0
cs.CLcs.SD Abner Hernandez, Eunjung Yeo, Kwanghee Choi et al. · Mar 23, 2026

Cross-lingual dysarthria detection in Parkinson's disease is hampered by language-dependent structure in self-supervised speech representations that confounds pathology classification. This paper proposes a centroid-based 'language shift' (LS) that aligns source-language embeddings toward target-language distributions using only healthy control speech, enabling zero-shot transfer without model retraining. The approach addresses the critical data scarcity in clinical speech applications while aiming to disentangle linguistic variation from motor impairment markers.

The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson's disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
0
eess.AScs.CLcs.SD Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · Mar 23, 2026

This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.

Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.
0
cs.CLcs.LG Chi Zhang, Xixi Hu, Bo Liu et al. · Mar 23, 2026

Parallel decoding promises faster text generation than autoregressive models but historically sacrifices quality due to simplified conditional independence assumptions. This paper introduces Gumbel Distillation, which leverages the Gumbel-Max trick to create a deterministic mapping from latent noise to teacher outputs, effectively providing the parallel student a blueprint for joint token distributions. By conditioning on Gumbel noise rather than relying on naive factorization, the method narrows the quality-efficiency gap, delivering substantial improvements across masked diffusion and multi-token prediction architectures.

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.
0
cs.CLcs.DB Lars Vogt · Mar 23, 2026

The paper tackles the 'semantic parsing burden'—the effort required to translate natural language into structured RDF/OWL representations for knowledge graphs. It proposes the Semantic Ladder, a five-level framework ($L_1$ to $L_5$) enabling progressive formalization from raw text snippets to higher-order logic. By introducing Rosetta Statements as semantic anchors and emphasizing modular semantic units, the work aims to lower barriers to knowledge graph construction while maintaining semantic continuity.

Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.
0
cs.CL Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish · Mar 23, 2026

As consumer-grade EEG headphones enter the market, a critical question emerges: can language models adapt to your specific neural signature? This paper demonstrates that frozen LLMs already contain person-specific linear directions in their activation spaces that predict individual brain activity during reading, achieving a ninefold improvement over population averages. The findings suggest that deep neural networks encode stable, individual cognitive fingerprints that could enable future brain-computer interfaces to personalize AI to the user wearing the headset.

Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person's brain activity but not another's. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual's EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model's deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.
0
cs.CLcs.CY Bros Victor, Dufraisse Evan, Popescu Adrian et al. · Mar 23, 2026

This paper analyzes temporal dynamics in Swiss digital news across French, German, and Italian language regions using a triangulated methodology that combines quantitative NLP with qualitative interpretation. The authors process 1.7 million articles to study how different event types—Brexit, Swiss Wolf, Christmas, and the British Royal Family—are covered across linguistic boundaries, introducing domestication profiles and proximity salience ratios to quantify cultural proximity effects.

Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country's three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.
0
cs.CLcs.MA Mohamed Sobhi Jabal (1), Jikai Zhang (2, 3) et al. · Mar 23, 2026

This paper tackles the challenge of automating BT-RADS (Brain Tumor Reporting and Data System) classification for post-treatment glioma MRI surveillance. BT-RADS requires integrating complex information: volumetric tumor changes, medication effects (steroids, bevacizumab), and radiation timing. The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS scoring is error-prone, with prior studies showing substantial inter-reader variability and inconsistent application of clinical context.

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
0
cs.CLcs.CV Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta et al. · Mar 22, 2026

BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
0
cs.CL Max Upravitelev, Veronika Solopova, Jing Yang et al. · Mar 23, 2026

Narrative similarity is inherently interpretive—different valid readings can yield divergent judgments, challenging benchmarks that encode single ground truths. This paper proposes embracing multiperspectivity by ensembling 31 LLM personas, ranging from literary critics to lay characters, to predict which of two stories is more similar to an anchor. The approach leverages Condorcet Jury Theorem-like dynamics to improve accuracy, achieving 0.705 on SemEval-2026 Task 4 while revealing that diverse practitioner perspectives yield better ensemble gains despite lower individual performance.

Predicting narrative similarity can be understood as an inherently interpretive task: different, equally valid readings of the same text can produce divergent interpretations and thus different similarity judgments, posing a fundamental challenge for semantic evaluation benchmarks that encode a single ground truth. Rather than treating this multiperspectivity as a challenge to overcome, we propose to incorporate it in the decision making process of predictive systems. To explore this strategy, we created an ensemble of 31 LLM personas. These range from practitioners following interpretive frameworks to more intuitive, lay-style characters. Our experiments were conducted on the SemEval-2026 Task 4 dataset, where the system achieved an accuracy score of 0.705. Accuracy improves with ensemble size, consistent with Condorcet Jury Theorem-like dynamics under weakened independence. Practitioner personas perform worse individually but produce less correlated errors, yielding larger ensemble gains under majority voting. Our error analysis reveals a consistent negative association between gender-focused interpretive vocabulary and accuracy across all persona categories, suggesting either attention to dimensions not relevant for the benchmark or valid interpretations absent from the ground truth. This finding underscores the need for evaluation frameworks that account for interpretive plurality.
0
cs.CL Tae-Eun Song · Mar 23, 2026

This paper introduces Cross-Context Verification (CCV), a black-box method for detecting LLM benchmark contamination by solving the same coding problem $N$ times in isolated sessions and measuring solution diversity. The key insight is that memorized solutions are deterministic while genuine reasoning produces natural variation. The paper pairs this with Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that uses strict information restriction to prevent confirmation bias. As coding benchmarks face credibility crises from solution leakage, this work targets the urgent need to distinguish reasoning from recall in SWE-bench evaluations.

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed. We introduce Cross-Context Verification (CCV), a black-box method that solves the same benchmark problem in N independent sessions and measures solution diversity, combined with the Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that prevents confirmation bias through intentional information restriction across specialized analytical roles. On 9 SWE-bench Verified problems (45 trials, Claude Opus 4.6, temperature 0), CCV achieves perfect separation between contaminated and genuine reasoning (Mann-Whitney U=0, p approx 0.012, r = 1.0). Key findings: (1) contamination is binary--models either recall perfectly or not at all; (2) reasoning absence is a perfect discriminator; (3) 33% of prior contamination labels are false positives; (4) HCCA's independent analysis structure discovers contamination-flaw composite cases that single-analyst approaches miss. A pilot experiment extending HCCA to multi-stage verification (Worker to Verifier to Director) yields a negative result--100% sycophantic confirmation--providing further evidence that information restriction, not structural complexity, is the key mechanism. We release all code and data.
0
cs.CLcs.CY Bros Victor, Barbini Matilde, Gerard Patrick et al. · Mar 23, 2026

This paper investigates how interrogative stances function as markers of voice and power in French-language digital news. Analyzing over 1.2 million articles from 24 outlets (2023–2024) through a mixed-methods pipeline combining LLM pseudo-labeling and qualitative annotation, the authors operationalize pragmatic concepts like answerhood and dialogicity at scale. The study reveals that questions are sparse but structurally significant, predominantly serving framing functions rather than information-seeking, and centering elite actors over diffuse publics.

Interrogatives in news discourse have been examined in linguistics and conversation analysis, but mostly in broadcast interviews and relatively small, often English-language corpora, while large-scale computational studies of news rarely distinguish interrogatives from declaratives or differentiate their functions. This paper brings these strands together through a mixed-methods study of the &#34;Politics of Questions&#34; in contemporary French-language digital news. Using over one million articles published between January 2023 and June 2024, we automatically detect interrogative stances, approximate their functional types, and locate textual answers when present, linking these quantitative measures to a qualitatively annotated subcorpus grounded in semantic and pragmatic theories of questions. Interrogatives are sparse but systematically patterned: they mainly introduce or organize issues, with most remaining cases being information-seeking or echo-like, while explicitly leading or tag questions are rare. Although their density and mix vary across outlets and topics, our heuristic suggests that questions are overwhelmingly taken up within the same article and usually linked to a subsequent answer-like span, most often in the journalist's narrative voice and less often through quoted speech. Interrogative contexts are densely populated with named individuals, organizations, and places, whereas publics and broad social groups are mentioned much less frequently, suggesting that interrogative discourse tends to foreground already prominent actors and places and thus exhibits strong personalization. We show how interrogative stance, textual uptake, and voice can be operationalized at corpus scale, and argue that combining computational methods with pragmatic and sociological perspectives can help account for how questioning practices structure contemporary news discourse.
0
cs.CL Neeladri Bhuiya, Shib Sankar Dasgupta, Andrew McCallum et al. · Mar 22, 2026

Prompt2Box addresses the limitation that vector embeddings of LLM prompts conflate topical similarity with specificity, making it difficult to distinguish whether a model fails at a broad topic or only at its most constrained variants. The core idea is to embed prompts into a box embedding space where the geometric volume encodes specificity—smaller boxes indicate more constraints—and containment represents entailment relations. This geometric re-framing enables more accurate hierarchical clustering and finer-grained weakness analysis across 17 different language models.

To discover the weaknesses of LLMs, researchers often embed prompts into a vector space and cluster them to extract insightful patterns. However, vector embeddings primarily capture topical similarity. As a result, prompts that share a topic but differ in specificity, and consequently in difficulty, are often represented similarly, making fine-grained weakness analysis difficult. To address this limitation, we propose PROMPT2BOX, which embeds prompts into a box embedding space using a trained encoder. The encoder, trained on existing and synthesized datasets, outputs box embeddings that capture not only semantic similarity but also specificity relations between prompts (e.g., &#34;writing an adventure story&#34; is more specific than &#34;writing a story&#34;). We further develop a novel dimension reduction technique for box embeddings to facilitate dataset visualization and comparison. Our experiments demonstrate that box embeddings consistently capture prompt specificity better than vector baselines. On the downstream task of creating hierarchical clustering trees for 17 LLMs from the UltraFeedback dataset, PROMPT2BOX can identify 8.9\% more LLM weaknesses than vector baselines and achieves an approximately 33\% stronger correlation between hierarchical depth and instruction specificity.
0
cs.CL Smitha Muthya Sudheendra, Jaideep Srivastava · Mar 22, 2026

Human annotation for subjective NLP tasks suffers from high inter-annotator disagreement. This paper introduces ReasonAlign, a protocol that exposes annotators to LLM-generated reasoning explanations (but not predicted labels) between two annotation passes. The goal is to test whether reasoning scaffolds improve annotation consistency without the anchoring bias typical of suggestion-based systems.

Human annotation is central to NLP evaluation, yet subjective tasks often exhibit substantial variability across annotators. While large language models (LLMs) can provide structured reasoning to support annotation, their influence on human annotation behavior remains unclear. We introduce ReasonAlign, a reasoning-based annotation scaffold that exposes LLM-generated explanations while withholding predicted labels. We frame this as a controlled study of how reasoning affects human annotation behavior, rather than a full evaluation of annotation accuracy. Using a two-pass protocol inspired by Delphi-style revision, annotators first label instances independently and then revise their decisions after viewing model-generated reasoning. We evaluate the approach on sentiment classification and opinion detection tasks, analyzing changes in inter-annotator agreement and revision behavior. To quantify these effects, we introduce the Annotator Effort Proxy (AEP), a metric capturing the proportion of labels revised after exposure to reasoning. Our results show that exposure to reasoning is associated with increased agreement alongside minimal revision, suggesting that reasoning primarily helps resolve ambiguous cases without inducing widespread changes. These findings provide insight into how reasoning explanations shape annotation consistency and highlight reasoning-based scaffolds as a practical mechanism for supporting human-AI annotation workflows.