Nothing here yet
Discrete diffusion models have been limited to simplistic noising schemes like uniform corruption or masking, restricting their ability to leverage semantic structure in large vocabularies. This paper introduces GDDS (Generalized Discrete Diffusion from Snapshots), a framework supporting arbitrary continuous-time Markov chain noising processes via exact uniformization-based sampling and a tractable snapshot-level ELBO. The work achieves state-of-the-art results on large-scale language modeling tasks, claiming to surpass autoregressive baselines for the first time at this scale.
This paper develops a differential-geometric framework for shallow neural networks that treats predictor classes rather than raw parameters as the fundamental objects. By quotienting out permutation and scaling symmetries on a regular set $\Theta_{\mathrm{reg}}$, the authors define a function-induced metric $g_\theta$ and an effective Hessian that removes spurious curvature degeneracies along symmetry orbits. The work connects implicit bias to quotient-level geometry, with concrete analysis for quadratic-activation models where parameters map explicitly to symmetric matrices $Q(\theta)=\sum_{i=1}^m a_i w_i w_i^\top$.
World models for reinforcement learning learn to simulate environment dynamics, yet what they represent internally remains unclear. This paper probes two architecturally distinct models—IRIS (a discrete token transformer) and DIAMOND (a continuous diffusion UNet)—on Atari Breakout and Pong using linear and MLP probes, causal interventions, and attention analysis to test whether they develop structured, interpretable representations of game state. The core finding is that world models develop approximately linear representations of salient state variables (ball position, score) that are not merely correlated but functionally used during prediction.
DMMRL tackles molecular property prediction by addressing two key challenges: entangled representations that obscure structure-property relationships and naïve multi-modal fusion that ignores inter-modal dependencies. The method uses variational autoencoders to decompose graph, sequence, and geometry features into shared (structure-relevant) and private (modality-specific) latent subspaces, enforcing orthogonality between them. A gated attention mechanism then fuses only the shared representations for downstream prediction.
GRPO training for LLM reasoning suffers from expensive rollouts and wasted compute on zero-variance prompts where all answers are correct or wrong. This paper proposes Prompt Replay, an overhead-free online method that buffers and reuses medium-difficulty prompts (pass rate near 0.5) to maximize gradient signal while staying on-policy by regenerating responses. By mixing replayed prompts with fresh samples and controlling reuse via cooldown steps and caps, the method aims to accelerate early training, though it eventually plateaus to baseline performance.
This paper tackles the lack of explicit memory mechanisms in transformers by introducing Mixture of Chapters (MoC)—a learned bank of 262K latent memory tokens accessed via cross-attention. To scale memory without prohibitive costs, the authors partition the bank into chapters and route each input sequence to a sparse subset (top-64), reducing complexity from $O(L \cdot N_m)$ to $O(L \cdot k \cdot T)$. The work demonstrates that explicit associative memory can serve as a new axis of scaling, showing improved knowledge retention when transitioning from pretraining to instruction fine-tuning.
This paper addresses reward hacking in reward-centric diffusion reinforcement learning (RDRL), where diffusion models exploit non-robust reward models to achieve high scores without actual perceptual quality improvements. The authors propose RSA-FT (Reward Sharpness-Aware Fine-Tuning), which mitigates hacking by flattening the reward landscape through joint perturbations in both image space (adversarial training) and parameter space (Sharpness-Aware Minimization). The method is plug-and-play, compatible with existing RDRL frameworks like ReFL and DRaFT, and shows consistent gains across SD1.5, SDXL, SD3, and Flux backbones.
This paper reframes plasticity loss in deep reinforcement learning as an optimization pathology rather than capacity degradation. The core claim—dubbed the Optimization-Centric Plasticity (OCP) hypothesis—is that parameters become trapped in local optima from previous tasks, which then become poor optima for new tasks. The authors prove that neuron dormancy is mathematically equivalent to zero-gradient states and show that plasticity recovers when tasks differ sufficiently, suggesting networks retain capacity but lose it to task-specific optimization landscapes.
Standard AlphaZero-style tree search for LLM reasoning suffers from a scaling failure: on GSM8K and Game24, accuracy actually drops as the search budget increases beyond moderate levels. This paper introduces ReSCALE, which replaces PUCT selection and Dirichlet noise with Gumbel sampling and Sequential Halving—a best-arm identification technique from multi-armed bandits. The key insight is that root action-selection design is critical for budget-scalable reasoning without any changes to the model or its training.
ViCLSR adapts supervised contrastive learning (SimCSE-style) to Vietnamese NLU by converting NLI entailment and contradiction pairs into positive and negative training signals. Built on XLM-R Large (550M), the framework improves sentence embeddings for low-resource Vietnamese, reporting gains of +6.97% F1 over PhoBERT on ViNLI and state-of-the-art results across five downstream tasks including fact-checking and machine reading comprehension.
Sonny tackles the compute barrier in medium-range weather forecasting by proposing a hierarchical transformer that trains on a single A40 GPU in 5.5 days. The core idea is a two-stage StepsNet pipeline: a narrow 'slow path' processes large-scale dynamics (U,V,Z,P) first, then a full-width 'fast path' integrates thermodynamics (T,Q). Combined with EMA during training, randomized dynamics forecasting, and pressure-weighted losses, Sonny aims to deliver competitive forecast skill without the TPU/GPU cluster requirements of models like Pangu-Weather or GraphCast.
This paper tackles the tension between local melodic continuity and global structural coherence in symbolic music generation. It proposes a hybrid architecture fusing a Transformer encoder (for global patterns) with an LSTM decoder (for temporal precision), evaluating it against pure LSTM and Transformer baselines using 17 musical quality metrics on 1,000 generated melodies per model. The work matters because it provides systematic evidence that architectural hybridization can reconcile the complementary strengths of memory-based and attention-based models.
The paper addresses federated fine-tuning of Mixture-of-Experts (MoE) based large language models under non-IID data distributions, where direct parameter aggregation causes gating preference misalignment and expert semantic blurring. The proposed FedAlign-MoE framework introduces consistency-based gating distribution alignment using routing consistency weighting ($\omega_i(e) = s_i(e)/\sum_j s_j(e)$) and semantic-aware expert aggregation via region-conditioned gated weights ($\gamma_{i,j}(e)$). This matters because MoE architectures are increasingly vital for scaling LLMs efficiently, yet data heterogeneity across federated clients undermines their specialization benefits.
This paper formalizes the transformer context window as an I/O page and proves that tool-augmented agents with indexed external memory achieve exponential retrieval cost savings over sequential scanning: $\mathcal{O}(\log_b N)$ versus $\Omega(N)$ page reads. The authors validate these predictions experimentally across three content types and identify "parametric memory competition" as a failure mode where models bypass retrieval protocols for familiar content.
Modern AI services increasingly run across the computing continuum—from cloud to edge devices—yet fault management remains challenging due to resource constraints, noisy telemetry, and cascading failures. This paper proposes NeSy-Edge, a three-layer neuro-symbolic framework that performs local log parsing, causal graph construction, and root-cause analysis on edge nodes, invoking cloud LLMs only when local evidence is insufficient. The core idea is to combine lightweight symbolic caching and prior-constrained causal discovery with selective neural inference, trading off autonomy against accuracy under strict memory budgets ($\sim$1500 MB).
This paper tackles multimodal misinformation detection by distinguishing between harmful and harmless visual content manipulation—a nuance often overlooked by existing methods. The authors propose Havc-m4d, a framework that extracts manipulation and intention features using weakly-supervised positive-unlabeled (PU) learning to overcome the lack of ground-truth manipulation labels. By treating real articles with manipulated visuals as likely harmless and fake articles as potentially harmful, the method introduces intention-aware cues that consistently improve detection across four benchmark datasets.
This paper addresses selection bias (position and label bias) in large language models during discrete-choice tasks like multiple-choice questions and pairwise evaluation. The authors propose Permutation-Aware GRPO (PA-GRPO), which extends Group Relative Policy Optimization by treating different permutations of the same question as a single training group rather than independent instances. The method enforces semantic consistency across permutations through two mechanisms: a cross-permutation advantage that computes rewards relative to the group mean, and a consistency-aware reward that penalizes disagreement across permutations. Experiments across seven benchmarks and three models (Llama-3.1-8B, Qwen3-8B, Qwen3-32B) demonstrate that PA-GRPO reduces selection bias while maintaining accuracy.
This paper addresses the challenge of "intelligent disobedience" in shared autonomy — when assistive AI must override human commands to prevent harm but remain helpful. The authors formalize this as the Intelligent Disobedience Game (IDG), a sequential Stackelberg game where a human leader proposes actions and an assistive follower with superior environmental awareness decides whether to obey or intervene. The framework aims to provide the mathematical foundations for training safety-critical assistive systems.
This paper addresses the problem of forecasting outlier events far in advance in time series data, rather than merely detecting immediate anomalies. The authors propose a two-layer framework that first computes outlier scores using standard detection methods, then models the temporal structure of these scores to predict future anomalies. By assuming that outlier occurrences exhibit temporal patterns (e.g., periodicity or delayed dependencies), the method aims to forecast outlier likelihoods without requiring future observations.
This paper investigates why compressing different weight matrices in transformers leads to wildly different outcomes—from negligible impact to 20,000× perplexity increases. The authors map this structural sensitivity across five architectures, revealing that early-layer MLP up-projections are catastrophically fragile while value projections are nearly free to compress. Using Lyapunov stability theory, they explain how residual connections contract errors, and they provide machine-checked formal bounds in Lean 4 to guarantee per-matrix approximation quality.