Nothing here yet
This paper addresses the static nature of Large Language Models that prevents dynamic adaptation to streaming contexts. The authors introduce In-Place Test-Time Training, which repurposes existing MLP down-projection matrices as “fast weights” that update during inference via a Next-Token Prediction (NTP)-aligned objective. Unlike prior TTT methods that require architectural changes, this approach enables “drop-in” enhancement of pretrained models without retraining from scratch.
GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$). The authors find that current LLMs struggle primarily with recall—discovering all verifiable interests—rather than hallucination, revealing critical bottlenecks in evidence counting across heterogeneous signal types.
Recursive Language Models (RLMs) tackle the long-context problem by treating prompts as external environment variables that an LLM can programmatically manipulate through a REPL. Instead of feeding long prompts directly into the neural network, RLMs use symbolic code execution to decompose, filter, and recursively invoke sub-models over prompt snippets. This allows processing inputs up to 10M+ tokens—two orders of magnitude beyond typical context windows—while maintaining strong performance on complex aggregation tasks.
This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.
Diffusion Language Models (DLMs) train with a static single-step masked prediction objective but infer via multi-step progressive denoising, creating a train-inference mismatch that compounds errors. MemDLM bridges this gap through Bi-level Optimization: an inner loop updates fast weights (Parametric Memory) to capture local trajectory experience, while an outer loop conditions the base model on this memory. The approach yields faster convergence, lower exposure bias, and substantial gains on long-context needle-in-a-haystack tasks, with an optional inference-time adaptation that acts as an emergent in-weight retrieval mechanism.
Selective prediction systems in LLMs abstain from answering uncertain questions to mitigate hallucination harms in high-stakes domains. This paper identifies a critical failure mode of entropy-based uncertainty quantification: the 'confidently wrong' regime where models produce low-entropy hallucinations. The authors propose combining entropy signals with correctness probes using logistic regression, and advocate for deployment-facing metrics—E-AURC and TCE—over AUROC to ensure systems can reliably operate at strict safety thresholds.
Ara-BEST-RQ introduces dedicated self-supervised speech models for Arabic dialects. The authors curate 5,640 hours of Creative Commons Arabic speech covering 20 dialects and train Conformer-based BEST-RQ models up to 600M parameters. Their 300M model achieves state-of-the-art dialect identification performance using fewer parameters than competing Whisper-based systems. This work helps close the gap for underrepresented Arabic dialects in speech technology.
Cross-lingual dysarthria detection in Parkinson's disease is hampered by language-dependent structure in self-supervised speech representations that confounds pathology classification. This paper proposes a centroid-based 'language shift' (LS) that aligns source-language embeddings toward target-language distributions using only healthy control speech, enabling zero-shot transfer without model retraining. The approach addresses the critical data scarcity in clinical speech applications while aiming to disentangle linguistic variation from motor impairment markers.
This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.
Parallel decoding promises faster text generation than autoregressive models but historically sacrifices quality due to simplified conditional independence assumptions. This paper introduces Gumbel Distillation, which leverages the Gumbel-Max trick to create a deterministic mapping from latent noise to teacher outputs, effectively providing the parallel student a blueprint for joint token distributions. By conditioning on Gumbel noise rather than relying on naive factorization, the method narrows the quality-efficiency gap, delivering substantial improvements across masked diffusion and multi-token prediction architectures.
The paper tackles the 'semantic parsing burden'—the effort required to translate natural language into structured RDF/OWL representations for knowledge graphs. It proposes the Semantic Ladder, a five-level framework ($L_1$ to $L_5$) enabling progressive formalization from raw text snippets to higher-order logic. By introducing Rosetta Statements as semantic anchors and emphasizing modular semantic units, the work aims to lower barriers to knowledge graph construction while maintaining semantic continuity.
As consumer-grade EEG headphones enter the market, a critical question emerges: can language models adapt to your specific neural signature? This paper demonstrates that frozen LLMs already contain person-specific linear directions in their activation spaces that predict individual brain activity during reading, achieving a ninefold improvement over population averages. The findings suggest that deep neural networks encode stable, individual cognitive fingerprints that could enable future brain-computer interfaces to personalize AI to the user wearing the headset.
This paper analyzes temporal dynamics in Swiss digital news across French, German, and Italian language regions using a triangulated methodology that combines quantitative NLP with qualitative interpretation. The authors process 1.7 million articles to study how different event types—Brexit, Swiss Wolf, Christmas, and the British Royal Family—are covered across linguistic boundaries, introducing domestication profiles and proximity salience ratios to quantify cultural proximity effects.
This paper tackles the challenge of automating BT-RADS (Brain Tumor Reporting and Data System) classification for post-treatment glioma MRI surveillance. BT-RADS requires integrating complex information: volumetric tumor changes, medication effects (steroids, bevacizumab), and radiation timing. The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS scoring is error-prone, with prior studies showing substantial inter-reader variability and inconsistent application of clinical context.
BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.
Narrative similarity is inherently interpretive—different valid readings can yield divergent judgments, challenging benchmarks that encode single ground truths. This paper proposes embracing multiperspectivity by ensembling 31 LLM personas, ranging from literary critics to lay characters, to predict which of two stories is more similar to an anchor. The approach leverages Condorcet Jury Theorem-like dynamics to improve accuracy, achieving 0.705 on SemEval-2026 Task 4 while revealing that diverse practitioner perspectives yield better ensemble gains despite lower individual performance.
This paper introduces Cross-Context Verification (CCV), a black-box method for detecting LLM benchmark contamination by solving the same coding problem $N$ times in isolated sessions and measuring solution diversity. The key insight is that memorized solutions are deterministic while genuine reasoning produces natural variation. The paper pairs this with Hierarchical Cross-Context Architecture (HCCA), a multi-agent analysis framework that uses strict information restriction to prevent confirmation bias. As coding benchmarks face credibility crises from solution leakage, this work targets the urgent need to distinguish reasoning from recall in SWE-bench evaluations.
This paper investigates how interrogative stances function as markers of voice and power in French-language digital news. Analyzing over 1.2 million articles from 24 outlets (2023–2024) through a mixed-methods pipeline combining LLM pseudo-labeling and qualitative annotation, the authors operationalize pragmatic concepts like answerhood and dialogicity at scale. The study reveals that questions are sparse but structurally significant, predominantly serving framing functions rather than information-seeking, and centering elite actors over diffuse publics.
Prompt2Box addresses the limitation that vector embeddings of LLM prompts conflate topical similarity with specificity, making it difficult to distinguish whether a model fails at a broad topic or only at its most constrained variants. The core idea is to embed prompts into a box embedding space where the geometric volume encodes specificity—smaller boxes indicate more constraints—and containment represents entailment relations. This geometric re-framing enables more accurate hierarchical clustering and finer-grained weakness analysis across 17 different language models.
Human annotation for subjective NLP tasks suffers from high inter-annotator disagreement. This paper introduces ReasonAlign, a protocol that exposes annotators to LLM-generated reasoning explanations (but not predicted labels) between two annotation passes. The goal is to test whether reasoning scaffolds improve annotation consistency without the anchoring bias typical of suggestion-based systems.