Feed - arxlens

0

Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison

cs.CL Caio Vicentino · Mar 23, 2026

This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is significant because prior studies compared these paradigms at different scales or with different datasets, making it impossible to attribute observed differences to the core architectural distinction itself.

We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.

Read abstractHide abstract

0

Probing How Scalable Table Data Enhances General Long-Context Reasoning

cs.CL Huaibing Xie, Guoliang Zhao, Yang Liu et al. · Mar 23, 2026

The paper tackles the challenge of enhancing long-context reasoning in Large Language Models (LLMs), a critical capability as real-world tasks grow more complex. It proposes structured table data as a solution, mathematically demonstrating via mutual information analysis that tables possess periodic non-vanishing dependencies—unlike natural language which decays polynomially—making them ideal for training long-context reasoning. The authors present TableLong, a scalable pipeline for synthesizing diverse, verifiable table data for reinforcement learning, showing significant performance gains across benchmarks.

As real-world tasks grow increasingly complex, long-context reasoning has become a core capability for Large Language Models (LLMs). However, few studies explore which data types are effective for long-context reasoning and why. We find that structured table data with periodic structures shows strong potential for long-context reasoning. Motivated by this observation, we mathematically analyze tabular dependency structures using mutual information, revealing periodic non-vanishing dependencies in table data. Furthermore, we systematically analyze the capabilities of structured table data, conduct relevant scaling experiments, and validate its underlying mechanisms for enhancing long-context reasoning, yielding several meaningful insights. Leveraging these insights, we propose a simple yet scalable pipeline(TableLong) for synthesizing high-quality, diverse, and verifiable structured table data to boost long-context reasoning via RL. Extensive experimental results demonstrate that table data significantly enhances the long-context reasoning capability of LLMs across multiple long-context benchmarks (+8.24\% on average), and even improves performance on out-of-domain benchmarks (+8.06\% on average). We hope that our insights provide practical guidance for effective post-training data to enhance long-context reasoning in LLMs.

Read abstractHide abstract

0

SqueezeComposer: Temporal Speed-up is A Simple Trick for Long-form Music Composing

eess.AS cs.CL cs.SD Jianyi Chen, Rongxiu Zhong, Shilei Zhang et al. · Mar 22, 2026

This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.

Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.

Read abstractHide abstract

0

Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch

cs.CL Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill · Mar 23, 2026

Cross-tokenizer knowledge distillation faces a fundamental alignment challenge when Teacher and Student models use different vocabularies. This paper analyzes DSKD-CMA, the state-of-the-art method for this setting, through manual chunk alignment probes and reveals that its cross-model attention mechanism captures coarse chunk structures but suffers from noisy localization with repeated tokens. Building on this insight, the authors propose DSKD-CMA-GA, which uses generative adversarial key-query matching to align distributions between models, achieving modest improvements in ROUGE-L scores that narrow the gap between cross-tokenizer and same-tokenizer distillation.

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.

Read abstractHide abstract

0

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

cs.CV cs.AI cs.CL Haichao Zhang, Yijiang Li, Shwai He et al. · Mar 23, 2026

ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

Read abstractHide abstract

0

Optimizing Multi-Agent Weather Captioning via Text Gradient Descent: A Training-Free Approach with Consensus-Aware Gradient Fusion

cs.CL Shixu Liu · Mar 23, 2026

Weather captioning—generating natural language descriptions from meteorological time series—sits at the intersection of time-series analysis and domain-specific NLG. This paper proposes WeatherTGD, a training-free framework that treats caption refinement as gradient descent in text space: three specialized LLM agents (Statistical, Physics, Meteorology) output textual gradients that are fused via a consensus-aware mechanism and applied iteratively to improve an initial caption. The approach aims to bridge the gap between numerical forecasting and human-interpretable explanations without any model fine-tuning.

Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.

Read abstractHide abstract

0

TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

cs.CL Li Wang, Yandong Wang, Xin Yu et al. · Mar 23, 2026

TAMTRL addresses the temporal credit assignment problem in multi-turn RL for long-context document processing. When LLMs process documents chunk-by-chunk with memory updates, standard outcome-only rewards cannot distinguish good from bad intermediate memory updates. The paper proposes using the model itself as a teacher: during training, it provides the model with filtered (relevant-only) chunks and uses the normalized token probabilities of the generated memory as turn-level rewards. This avoids expensive rollouts or external judges while providing fine-grained supervision for each turn.

The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.

Read abstractHide abstract

0

Retrieving Climate Change Disinformation by Narrative

cs.CL Max Upravitelev, Veronika Solopova, Charlott Jakob et al. · Mar 23, 2026

This paper reframes climate disinformation detection from classification to retrieval, treating narrative core messages as queries to rank corpus texts without fixed taxonomies. They propose SpecFi, which generates hypothetical documents using community summaries from graph-based detection (NodeRAG) as few-shot examples. The approach achieves MAP 0.505 on CARDS and demonstrates robustness to high narrative variance that cripples standard baselines.

Detecting climate disinformation narratives typically relies on fixed taxonomies, which do not accommodate emerging narratives. Thus, we re-frame narrative detection as a retrieval task: given a narrative's core message as a query, rank texts from a corpus by alignment with that narrative. This formulation requires no predefined label set and can accommodate emerging narratives. We repurpose three climate disinformation datasets (CARDS, Climate Obstruction, climate change subset of PolyNarrative) for retrieval evaluation and propose SpecFi, a framework that generates hypothetical documents to bridge the gap between abstract narrative descriptions and their concrete textual instantiations. SpecFi uses community summaries from graph-based community detection as few-shot examples for generation, achieving a MAP of 0.505 on CARDS without access to narrative labels. We further introduce narrative variance, an embedding-based difficulty metric, and show via partial correlation analysis that standard retrieval degrades on high-variance narratives (BM25 loses 63.4% of MAP), while SpecFi-CS remains robust (32.7% loss). Our analysis also reveals that unsupervised community summaries converge on descriptions close to expert-crafted taxonomies, suggesting that graph-based methods can surface narrative structure from unlabeled text.

Read abstractHide abstract

0

Reading Between the Lines: How Electronic Nonverbal Cues shape Emotion Decoding

cs.CL cs.HC Taara Kumar, Kokil Jaidka · Mar 22, 2026

This paper investigates how users decode emotions in text-based communication through electronic nonverbal cues (eNVCs)—orthographic signals like elongation, punctuation, and emojis that approximate paralinguistic features. The authors propose a taxonomy grounded in nonverbal communication theory (kinesics and paralinguistics) and test it across three complementary studies: a content analysis developing a regex detection toolkit, a within-subjects experiment manipulating eNVC presence and sarcasm ($n=513$), and focus groups exploring interpretive strategies. The work identifies sarcasm as a critical boundary condition where eNVCs fail to aid interpretation and provides an open-source Python/R package for automated cue detection.

As text-based computer-mediated communication (CMC) increasingly structures everyday interaction, a central question re-emerges with new urgency: How do users reconstruct nonverbal expression in environments where embodied cues are absent? This paper provides a systematic, theory-driven account of electronic nonverbal cues (eNVCs) - textual analogues of kinesics, vocalics, and paralinguistics - in public microblog communication. Across three complementary studies, we advance conceptual, empirical, and methodological contributions. Study 1 develops a unified taxonomy of eNVCs grounded in foundational nonverbal communication theory and introduces a scalable Python toolkit for their automated detection. Study 2, a within-subject survey experiment, offers controlled causal evidence that eNVCs substantially improve emotional decoding accuracy and lower perceived ambiguity, while also identifying boundary conditions, such as sarcasm, under which these benefits weaken or disappear. Study 3, through focus group discussions, reveals the interpretive strategies users employ when reasoning about digital prosody, including drawing meaning from the absence of expected cues and defaulting toward negative interpretations in ambiguous contexts. Together, these studies establish eNVCs as a coherent and measurable class of digital behaviors, refine theoretical accounts of cue richness and interpretive effort, and provide practical tools for affective computing, user modeling, and emotion-aware interface design. The eNVC detection toolkit is available as a Python and R package at https://github.com/kokiljaidka/envc.

Read abstractHide abstract

0

Semantic Shift: the Fundamental Challenge in Text Embedding and Retrieval

cs.CL cs.IR Hang Gao, Dimitris N. Metaxas · Mar 22, 2026

This paper identifies "semantic shift"—the intrinsic evolution of meaning within a text—as the root cause of embedding pathologies like anisotropy and length-induced collapse. The authors argue that pooling-based aggregation forces "semantic smoothing," where diverse sentences compromise into a diluted representation. They formalize semantic shift as the product of local evolution and global dispersion ($\mathrm{Shift}(k) = \mathrm{Local}(k) \cdot \mathrm{Disp}(k)$), showing through controlled concatenation experiments that it predicts embedding concentration and retrieval degradation better than text length alone. The work reframes geometric pathologies not as inherent model defects but as consequences of content structure interacting with pooling mechanics.

Transformer-based embedding models rely on pooling to map variable-length text into a single vector, enabling efficient similarity search but also inducing well-known geometric pathologies such as anisotropy and length-induced embedding collapse. Existing accounts largely describe \emph{what} these pathologies look like, yet provide limited insight into \emph{when} and \emph{why} they harm downstream retrieval. In this work, we argue that the missing causal factor is \emph{semantic shift}: the intrinsic, structured evolution and dispersion of semantics within a text. We first present a theoretical analysis of \emph{semantic smoothing} in Transformer embeddings: as the semantic diversity among constituent sentences increases, the pooled representation necessarily shifts away from every individual sentence embedding, yielding a smoothed and less discriminative vector. Building on this foundation, we formalize semantic shift as a computable measure integrating local semantic evolution and global semantic dispersion. Through controlled experiments across corpora and multiple embedding models, we show that semantic shift aligns closely with the severity of embedding concentration and predicts retrieval degradation, whereas text length alone does not. Overall, semantic shift offers a unified and actionable lens for understanding embedding collapse and for diagnosing when anisotropy becomes harmful.

Read abstractHide abstract

0

Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models

cs.CL Abdul-Salem Beibitkhan · Mar 22, 2026

This paper investigates whether cross-lingual transfer (CLT)—prompting models to translate queries to English, reason in English, then translate answers back—can bridge the performance gap for low-resource languages. The authors benchmark eight LLMs across 2,000 responses in Kazakh and Mongolian, finding that CLT selectively benefits bilingual models (+2.2–4.3pp) but not English-first architectures, while revealing a concerning "fluency illusion" where models appear fluent in LRLs while producing less accurate content.

We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally grounded categories, we evaluate 2,000 responses on accuracy, fluency, and completeness. We find a consistent performance gap of 13.8-16.7 percentage points between English and low-resource language conditions, with models maintaining surface-level fluency while producing significantly less accurate content. Cross-lingual transfer-prompting models to reason in English before translating back-yields selective gains for bilingual architectures (+2.2pp to +4.3pp) but provides no benefit to English-dominant models. Our results demonstrate that current LLMs systematically underserve low-resource language communities, and that effective mitigation strategies are architecture-dependent rather than universal.

Read abstractHide abstract

0

Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

cs.CL Navya Mehrotra, Adam Visokay, Kristina Gligori\'c · Mar 22, 2026

LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates limited human labels to groups where LLM proxies are least reliable. The core contribution is an error-predictor-driven sampling rule that improves estimation accuracy for harder-to-model demographics while maintaining valid frequentist coverage.

Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

Read abstractHide abstract

0

Conspiracy Frame: a Semiotically-Driven Approach for Conspiracy Theories Detection

cs.CL Heidi Campana Piva, Shaina Ashraf, Maziar Kianimoghadam Jouneghani et al. · Mar 22, 2026

The paper proposes the Conspiracy Frame, a semiotic and frame-semantic representation of conspiratorial narratives with five elements (plan, secret, in-group, out-group, call-to-action), and introduces Con.Fra., a span-annotated Telegram corpus. The core hypothesis is that injecting FrameNet-derived semantic frames into LLM prompts will improve conspiracy detection and explainability. Results show that while frame-guided prompting achieves comparable classification scores to few-shot learning, it does not consistently outperform it, though it reveals interesting abstract semantic patterns.

Conspiracy theories are anti-authoritarian narratives that lead to social conflict, impacting how people perceive political information. To help in understanding this issue, we introduce the Conspiracy Frame: a fine-grained semantic representation of conspiratorial narratives derived from frame-semantics and semiotics, which spawned the Conspiracy Frames (Con.Fra.) dataset: a corpus of Telegram messages annotated at span-level. The Conspiracy Frame and Con.Fra. dataset contribute to the implementation of a more generalizable understanding and recognition of conspiracy theories. We observe the ability of LLMs to recognize this phenomenon in-domain and out-of-domain, investigating the role that frames may have in supporting this task. Results show that, while the injection of frames in an in-context approach does not lead to clear increase of performance, it has potential; the mapping of annotated spans with FrameNet shows abstract semantic patterns (e.g., `Kinship', `Ingest\_substance') that potentially pave the way for a more semantically- and semiotically-aware detection of conspiratorial narratives.

Read abstractHide abstract

0

On the Challenges and Opportunities of Learned Sparse Retrieval for Code

cs.IR cs.CL Simon Lupart, Maxime Louis, Thibault Formal et al. · Mar 23, 2026

Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including subword fragmentation, semantic gaps between natural language and code, and latency issues from long code documents. Their lightweight single-stage training achieves 75.4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79.0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

Read abstractHide abstract

0

Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles

cs.CL Adi Gabay, Gabriel Stanovsky, Liat Peterfreund · Mar 22, 2026

This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles like the Muddy Children problem. The authors persuasively argue that memorization is better understood as a special case of reduction, where models map new instances to known problems. They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is required. The work reframes how we interpret LLM performance on canonical reasoning benchmarks and highlights that strong accuracy on classic puzzles may mask a lack of genuine reasoning capability.

Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.

Read abstractHide abstract

0

Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

cs.CL cs.LG Jinghan Cao, Yu Ma, Xinjin Li et al. · Mar 22, 2026

This paper addresses the high computational cost of deploying Large Language Models (LLMs) in resource-constrained environments by introducing the Performance-Efficiency Ratio (PER), a novel metric that integrates accuracy, throughput, memory, and latency via geometric mean normalization. The authors evaluate 16 open-source language models ranging from 0.5B to 72B parameters across five NLP tasks (IMDB, HellaSwag, ARC-Easy, SQuAD 2.0, and GSM8K), concluding that small models (0.5–3B parameters) consistently achieve superior PER scores compared to their larger counterparts.

Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.

Read abstractHide abstract

0

TiCo: Time-Controllable Training for Spoken Dialogue Models

cs.CL cs.AI eess.AS Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al. · Mar 23, 2026

TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

Read abstractHide abstract

0

DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing

cs.CL Nasser-Eddine Monir, Zakaria Baou · Mar 23, 2026

DATASHI is a new parallel corpus for Tashlhiyt, a critically under-resourced Amazigh language spoken by millions in Morocco but lacking standardized digital resources. The paper introduces 5,000 English–Tashlhiyt sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, designed to benchmark orthography normalization. Using this corpus, the authors evaluate five state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) on the normalization task, finding that even the best model (Gemini-2.5-Pro) achieves only moderate accuracy (35.5% WER) and struggles with gemination and emphatic consonants.

DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.

Read abstractHide abstract

0

BHDD: A Burmese Handwritten Digit Dataset

cs.CV cs.CL Swan Htet Aung, Hein Htet, Htoo Say Wah Khaing et al. · Mar 23, 2026

This paper introduces BHDD, the first public benchmark dataset for handwritten Burmese digits. Myanmar script's distinctive circular letterforms—originally developed for writing on palm leaves—create recognition challenges distinct from Latin digits, with pairs like 0 and 1 differing only by whether a circle is closed. The authors release 87,561 verified images (28×28 grayscale, MNIST-compatible format) from over 150 contributors, with writer-independent train/test splits and baseline models reaching up to 99.83% accuracy.

We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD

Read abstractHide abstract

0

Graph Fusion Across Languages using Large Language Models

cs.CL cs.IR Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan · Mar 22, 2026

This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment capability for low-resource languages where traditional embedding-based methods fail due to lack of training data.

Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.

Read abstractHide abstract

Nothing here yet