Nothing here yet
Psychiatric symptom identification from social media requires expensive expert annotation and suffers from inconsistent labeling across platforms. SynSym addresses this by using GPT-4o to generate synthetic training data across four stages: symptom concept expansion, dual-style (clinical/colloquial) expression generation, clinically-grounded multi-symptom composition, and LLM-based quality filtering. The framework produces 18,254 samples covering 14 DSM-5 symptoms, enabling models to match real-data performance and generalize across diverse social media platforms.
Audio-enabled large language models promise to democratize AI access for users with disabilities or limited literacy, but voice interfaces introduce immutable paralinguistic cues—pitch, timbre, prosody—that carry demographic signals. This paper demonstrates that state-of-the-art audio LLMs systematically discriminate based on speaker voice, assigning gender-stereotyped adjectives and professions solely from acoustic features. Crucially, the authors show that voice inputs amplify bias beyond text-only baselines, with models exhibiting stronger stereotypical associations when processing speech than when processing equivalent text with gendered name cues. The study establishes a causal link via pitch manipulation experiments and surveys 1,000 users to reveal that those who would benefit most from voice accessibility are often most hesitant about the attendant privacy and discrimination risks.
SLURP-TN introduces a Spoken Language Understanding (SLU) dataset for Tunisian Arabic, a low-resource dialect. The authors translate and record six domains from the English SLURP corpus with 55 speakers across 18 geographic regions, emphasizing gender balance and code-switching phenomena. The dataset provides approximately five hours of audio across three acoustic conditions (clean, noisy, headphone) to enable robust benchmarking of ASR and SLU systems for dialectal Arabic.
MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO as a continual experience accumulation process using a dual-memory mechanism—Correct-Template Memory ($\mathcal{E}_{\mathrm{CTM}}$) for successful strategies and Error-Pattern Memory ($\mathcal{E}_{\mathrm{EPM}}$) for failure modes—that enables cross-task generalization while reducing optimization costs by approximately 57% compared to strong baselines.
The paper introduces Dissimilar Span Detection (DSD), a new task aimed at explaining Semantic Textual Similarity (STS) scores by identifying specific text spans that differ in meaning between sentence pairs. To enable this research, the authors release the Span Similarity Dataset (SSD), containing 1,000 semi-automatically annotated samples validated by human annotators. They evaluate a broad range of approaches—including LIME, SHAP, proprietary LLMs, and supervised token classifiers—and find that while LLMs achieve the highest performance, the task remains challenging even for state-of-the-art models, with potential applications in paraphrase detection and fact-checking.
Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic document-level data from summarization corpora via LLM augmentation, filters this data using sacreBLEU, COMET, and LaBSE cosine similarity, and then trains models first on sentence-level data before adapting to the filtered document corpus.
This paper distinguishes different forms of reasoning by the structural properties they demand from underlying representational systems. The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for AI systems and cognitive science, providing a principled boundary between reasoning that can rely on associative approximations versus reasoning requiring structural guarantees.
This paper attacks the expensive problem of annotating NLP test sets by importing Active Testing (AT) from computer vision into language tasks. Given a labeling budget $B$, the goal is to select a subset $X_A$ that minimizes the estimation error $|M(X_F) - M(X_A)|$ between full and sampled test-set metrics, potentially cutting annotation costs by up to 95% while keeping prediction error under 1%. The core mechanism couples importance-weighted unbiased estimators with acquisition strategies (including a novel Agreement strategy based on attention-head disagreement) and an adaptive stopping criterion that removes the need to pre-specify the budget.
SecureBreak introduces a response-level safety dataset designed to detect harmful LLM outputs that bypass alignment mechanisms. Unlike existing benchmarks that classify prompts, this work focuses on binary classification of generated responses (safe vs. unsafe) across 3,059 samples from multiple model families including Llama, Qwen, Gemma, and Mistral. The core value proposition is providing a 'last-line defense' layer for post-generation filtering and supervisory signals to guide security re-alignment, addressing the growing threat of jailbreak attacks.
The paper addresses the brittleness of in-context learning (ICL) to example ordering, an intractable $n!$ search problem. It proposes PLR, which reframes discrete permutation search as learning a Plackett-Luce distribution that concentrates probability mass on high-performing orderings. Using Gumbel perturb-and-sort for efficient sampling, PLR optimizes task-level metrics directly without requiring finite label spaces, extending naturally to open-ended reasoning tasks like mathematical problem solving.
This paper introduces TaigiSpeech, the first intent recognition dataset for Taiwanese Hokkien—a low-resource language spoken by 65% of Taiwanese elders. With 3,000+ utterances from 21 elderly speakers across emergency and smart-home scenarios, it addresses a critical gap in speech technology for aging populations. The authors also propose keyword-based and audio-visual mining strategies to bootstrap training data from unlabeled video sources.
Symbolic regression search spaces suffer from structural redundancy: expression DAGs with $k$ internal nodes admit $\Theta(k!)$ distinct node-numberings that encode the same mathematical expression. This paper proposes IsalSR, a representation framework that computes a pruned canonical string—a complete labeled-DAG isomorphism invariant—to collapse all equivalent forms into a single canonical representation. The approach promises to reduce effective search space size by $O(k!)$ and can be integrated into any existing SR algorithm as a preprocessing step.
TIDE is a post-training early exit system for autoregressive LLMs that trains lightweight router MLPs to predict which tokens can safely exit at intermediate layers. The key idea is using cosine similarity between checkpoint hidden states and final layer outputs as a convergence signal, eliminating the need for costly model retraining. Unlike prior early-exit methods that require training from scratch or use unreliable confidence heuristics, TIDE claims to work with any HuggingFace causal LM while preserving KV cache integrity and achieving up to 8.1% throughput improvement.
Developing optimized CUDA kernels is critical for generative AI but remains challenging even for human experts. This paper introduces DRTriton, a framework that trains a 7B-parameter LLM to convert PyTorch code into efficient Triton kernels using exclusively synthetic data. The approach combines a constraint satisfaction algorithm for program generation (CSP-DAG), curriculum reinforcement learning with decoupled rewards (DRPO), and test-time search, achieving 92% speedup on KernelBench Level 2 compared to 23% for GPT-5.2.
This paper presents a large-scale comparative study of memorization across six open LLM families (Pythia, OLMo1/2/3, OpenLLaMA, StarCoder) ranging from 1B to 32B parameters. By analyzing both statistical patterns and internal mechanisms (attention heads, layer decoding), it identifies universal behaviors—such as log-linear scaling of memorization rates with model size and high compressibility of memorized sequences—while revealing family-specific signatures in memorization structure. The work bridges isolated findings from single-model studies to establish general principles of how transformers memorize training data.
EvoIdeator addresses the challenge of iteratively refining scientific research ideas using LLMs by bridging the gap between scalar RL rewards and coarse language feedback. The core innovation is a dual-signal approach combining lexicographic rewards with checklist-grounded, span-level language feedback integrated directly into the RL training loop using Dr. GRPO. This allows a 4B parameter model to outperform larger frontier models like Gemini 3 Flash and DeepSeek-V3.2 on scientific rigor criteria.
Medical text summarization helps clinicians process millions of biomedical articles, but fine-tuning large language models demands prohibitive resources. This paper compares Low-Rank Adaptation (LoRA), Prompt Tuning, and full fine-tuning across Flan-T5-Small, Base, and Large on PubMed summarization. The counter-intuitive finding is that updating fewer than 1% of parameters via LoRA consistently outperforms full fine-tuning, suggesting that low-rank constraints provide effective regularization.
Time toxicity—the cumulative healthcare contact days imposed by clinical trial participation—is an important patient-centric metric buried in dense Schedule of Assessments (SoA) tables. This work proposes TimeTox, a Gemini-based LLM pipeline that extracts time toxicity from protocol PDFs at scale, comparing a single-pass architecture against a two-stage structure-then-count approach. The authors deploy their system on 644 real-world oncology protocols and find that synthetic benchmark accuracy is a poor predictor of real-world reliability, a lesson critical for clinical NLP deployment.
This paper introduces SemEval-2026 Task 12, Abductive Event Reasoning (AER), a shared task requiring systems to identify the most plausible direct cause of a target event from noisy multi-document evidence. The task is cast as an evidence-grounded multiple-choice benchmark with multiple correct answers allowed, capturing challenges like distributed evidence, indirect background factors, and semantically related distractors. With 122 participants and 518 submissions, it represents a significant community effort to benchmark real-world causal reasoning in long-context settings.
ROM tackles overthinking in Large Reasoning Models, where models generate redundant reasoning after reaching correct answers. The core idea is a lightweight streaming detector—an 8.13M parameter head attached to late-layer hidden states of a frozen LLM—that predicts overthinking probability token-by-token and triggers early stopping. It matters because it promises 47% token reduction without full model retraining. We find the method empirically effective but note concerns regarding data scaling limits and labeling costs.