Nothing here yet
This paper addresses a key gap in language model research by conducting the first tightly controlled comparison between autoregressive (AR) and masked diffusion language models (MDLM). The author trains both models on identical data (50M tokens from TinyStories), identical compute budget (20K steps, batch size 32), and identical hardware (NVIDIA H100), isolating the generation paradigm as the sole variable. The work is significant because prior studies compared these paradigms at different scales or with different datasets, making it impossible to attribute observed differences to the core architectural distinction itself.
The paper tackles the challenge of enhancing long-context reasoning in Large Language Models (LLMs), a critical capability as real-world tasks grow more complex. It proposes structured table data as a solution, mathematically demonstrating via mutual information analysis that tables possess periodic non-vanishing dependencies—unlike natural language which decays polynomially—making them ideal for training long-context reasoning. The authors present TableLong, a scalable pipeline for synthesizing diverse, verifiable table data for reinforcement learning, showing significant performance gains across benchmarks.
This paper proposes SqueezeComposer, a long-form music generation framework that tackles computational constraints by applying temporal speed-up (e.g., 2×, 4×, 8×) to compress audio sequences before generation. The core idea is to generate music in an accelerated domain using diffusion models, then restore it to normal speed, theoretically enabling models to produce 10+ minute compositions with fixed memory budgets. The approach is tested on continuation, completion, and singing accompaniment tasks.
Cross-tokenizer knowledge distillation faces a fundamental alignment challenge when Teacher and Student models use different vocabularies. This paper analyzes DSKD-CMA, the state-of-the-art method for this setting, through manual chunk alignment probes and reveals that its cross-model attention mechanism captures coarse chunk structures but suffers from noisy localization with repeated tokens. Building on this insight, the authors propose DSKD-CMA-GA, which uses generative adversarial key-query matching to align distributions between models, achieving modest improvements in ROUGE-L scores that narrow the gap between cross-tokenizer and same-tokenizer distillation.
ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.
Weather captioning—generating natural language descriptions from meteorological time series—sits at the intersection of time-series analysis and domain-specific NLG. This paper proposes WeatherTGD, a training-free framework that treats caption refinement as gradient descent in text space: three specialized LLM agents (Statistical, Physics, Meteorology) output textual gradients that are fused via a consensus-aware mechanism and applied iteratively to improve an initial caption. The approach aims to bridge the gap between numerical forecasting and human-interpretable explanations without any model fine-tuning.
TAMTRL addresses the temporal credit assignment problem in multi-turn RL for long-context document processing. When LLMs process documents chunk-by-chunk with memory updates, standard outcome-only rewards cannot distinguish good from bad intermediate memory updates. The paper proposes using the model itself as a teacher: during training, it provides the model with filtered (relevant-only) chunks and uses the normalized token probabilities of the generated memory as turn-level rewards. This avoids expensive rollouts or external judges while providing fine-grained supervision for each turn.
This paper reframes climate disinformation detection from classification to retrieval, treating narrative core messages as queries to rank corpus texts without fixed taxonomies. They propose SpecFi, which generates hypothetical documents using community summaries from graph-based detection (NodeRAG) as few-shot examples. The approach achieves MAP 0.505 on CARDS and demonstrates robustness to high narrative variance that cripples standard baselines.
This paper investigates how users decode emotions in text-based communication through electronic nonverbal cues (eNVCs)—orthographic signals like elongation, punctuation, and emojis that approximate paralinguistic features. The authors propose a taxonomy grounded in nonverbal communication theory (kinesics and paralinguistics) and test it across three complementary studies: a content analysis developing a regex detection toolkit, a within-subjects experiment manipulating eNVC presence and sarcasm ($n=513$), and focus groups exploring interpretive strategies. The work identifies sarcasm as a critical boundary condition where eNVCs fail to aid interpretation and provides an open-source Python/R package for automated cue detection.
This paper identifies "semantic shift"—the intrinsic evolution of meaning within a text—as the root cause of embedding pathologies like anisotropy and length-induced collapse. The authors argue that pooling-based aggregation forces "semantic smoothing," where diverse sentences compromise into a diluted representation. They formalize semantic shift as the product of local evolution and global dispersion ($\mathrm{Shift}(k) = \mathrm{Local}(k) \cdot \mathrm{Disp}(k)$), showing through controlled concatenation experiments that it predicts embedding concentration and retrieval degradation better than text length alone. The work reframes geometric pathologies not as inherent model defects but as consequences of content structure interacting with pooling mechanics.
This paper investigates whether cross-lingual transfer (CLT)—prompting models to translate queries to English, reason in English, then translate answers back—can bridge the performance gap for low-resource languages. The authors benchmark eight LLMs across 2,000 responses in Kazakh and Mongolian, finding that CLT selectively benefits bilingual models (+2.2–4.3pp) but not English-first architectures, while revealing a concerning "fluency illusion" where models appear fluent in LRLs while producing less accurate content.
LLM annotations encode some human perspectives better than others, especially in subjective tasks where demographic background shapes judgments. This paper introduces Perspective-Driven Inference (PDI), a statistical framework that treats the distribution of group-specific annotations as a vector estimand $\theta^* = (\theta^*_{g_1}, \dots, \theta^*_{g_K})$ and adaptively allocates limited human labels to groups where LLM proxies are least reliable. The core contribution is an error-predictor-driven sampling rule that improves estimation accuracy for harder-to-model demographics while maintaining valid frequentist coverage.
The paper proposes the Conspiracy Frame, a semiotic and frame-semantic representation of conspiratorial narratives with five elements (plan, secret, in-group, out-group, call-to-action), and introduces Con.Fra., a span-annotated Telegram corpus. The core hypothesis is that injecting FrameNet-derived semantic frames into LLM prompts will improve conspiracy detection and explainability. Results show that while frame-guided prompting achieves comparable classification scores to few-shot learning, it does not consistently outperform it, though it reveals interesting abstract semantic patterns.
Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including subword fragmentation, semantic gaps between natural language and code, and latency issues from long code documents. Their lightweight single-stage training achieves 75.4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79.0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.
This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles like the Muddy Children problem. The authors persuasively argue that memorization is better understood as a special case of reduction, where models map new instances to known problems. They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is required. The work reframes how we interpret LLM performance on canonical reasoning benchmarks and highlights that strong accuracy on classic puzzles may mask a lack of genuine reasoning capability.
This paper addresses the high computational cost of deploying Large Language Models (LLMs) in resource-constrained environments by introducing the Performance-Efficiency Ratio (PER), a novel metric that integrates accuracy, throughput, memory, and latency via geometric mean normalization. The authors evaluate 16 open-source language models ranging from 0.5B to 72B parameters across five NLP tasks (IMDB, HellaSwag, ARC-Easy, SQuAD 2.0, and GSM8K), concluding that small models (0.5–3B parameters) consistently achieve superior PER scores compared to their larger counterparts.
TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.
DATASHI is a new parallel corpus for Tashlhiyt, a critically under-resourced Amazigh language spoken by millions in Morocco but lacking standardized digital resources. The paper introduces 5,000 English–Tashlhiyt sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, designed to benchmark orthography normalization. Using this corpus, the authors evaluate five state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) on the normalization task, finding that even the best model (Gemini-2.5-Pro) achieves only moderate accuracy (35.5% WER) and struggles with gemination and emphatic consonants.
This paper introduces BHDD, the first public benchmark dataset for handwritten Burmese digits. Myanmar script's distinctive circular letterforms—originally developed for writing on palm leaves—create recognition challenges distinct from Latin digits, with pairs like 0 and 1 differing only by whether a circle is closed. The authors release 87,561 verified images (28×28 grayscale, MNIST-compatible format) from over 150 contributors, with writer-independent train/test splits and baseline models reaching up to 99.83% accuracy.
This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment capability for low-resource languages where traditional embedding-based methods fail due to lack of training data.