Nothing here yet
OrbitStream addresses adaptive 360° video streaming for teleoperation by proposing a training-free framework that combines semantic scene understanding with robust control theory. It formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem where semantic objects (pedestrians, vehicles) generate potential fields that "attract" user gaze with task-relevant mass, while a Saturation-Based Proportional-Derivative (PD) Controller handles bitrate adaptation. This offers an interpretable, zero-shot alternative to black-box Deep Reinforcement Learning methods for safety-critical systems where deployment constraints prohibit lengthy training.
Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including subword fragmentation, semantic gaps between natural language and code, and latency issues from long code documents. Their lightweight single-stage training achieves 75.4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79.0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.
RAMPAGE addresses discretization bias in Extragradient (EG) methods for variational inequalities by replacing the deterministic midpoint with randomized sampling. The core idea uses uniform sampling to construct an unbiased estimator of the continuous-time flow integral, while RAMPAGE+ leverages antithetic variates to eliminate first-order variance terms. This matters for training GANs and other non-conservative games where EG's $\mathcal{O}(\eta^2)$ bias causes divergence in highly nonlinear regimes.
This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles like the Muddy Children problem. The authors persuasively argue that memorization is better understood as a special case of reduction, where models map new instances to known problems. They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is required. The work reframes how we interpret LLM performance on canonical reasoning benchmarks and highlights that strong accuracy on classic puzzles may mask a lack of genuine reasoning capability.
Medical vision-language models (VLMs) are increasingly evaluated for consistency—the invariance of predictions under paraphrased prompts—as a proxy for clinical reliability. This paper demonstrates that consistency alone is a fundamentally flawed safety metric because models can achieve perfect consistency by learning text shortcuts while completely ignoring the input image. The authors introduce a four-quadrant per-sample taxonomy that jointly evaluates consistency and image reliance, revealing that models optimized for low flip rates often shift samples into a 'Dangerous' quadrant where predictions are stable, accurate, and confident yet unchanged when the image is removed. Their findings expose a critical deployment trap: standard evaluation pipelines risk preferentially selecting models that appear reliable while being decision-invariant to visual evidence.
This paper addresses the high computational cost of deploying Large Language Models (LLMs) in resource-constrained environments by introducing the Performance-Efficiency Ratio (PER), a novel metric that integrates accuracy, throughput, memory, and latency via geometric mean normalization. The authors evaluate 16 open-source language models ranging from 0.5B to 72B parameters across five NLP tasks (IMDB, HellaSwag, ARC-Easy, SQuAD 2.0, and GSM8K), concluding that small models (0.5–3B parameters) consistently achieve superior PER scores compared to their larger counterparts.
Virtual cell modeling aims to simulate cellular responses to drug perturbations in silico, but existing flow-matching models optimize only pixel-level reconstruction and can produce biologically implausible outputs like nuclei outside cytoplasm. CellFluxRL addresses this by post-training the state-of-the-art CellFlux model with reinforcement learning, using seven manually designed reward functions spanning biological function (mode of action), structural validity (nuclear containment), and morphological statistics (size/count). The approach reveals a systematic framework for enforcing physical constraints through differentiable optimization, achieving consistent improvements across all biological metrics while maintaining image quality.
TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.
DATASHI is a new parallel corpus for Tashlhiyt, a critically under-resourced Amazigh language spoken by millions in Morocco but lacking standardized digital resources. The paper introduces 5,000 English–Tashlhiyt sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, designed to benchmark orthography normalization. Using this corpus, the authors evaluate five state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) on the normalization task, finding that even the best model (Gemini-2.5-Pro) achieves only moderate accuracy (35.5% WER) and struggles with gemination and emphatic consonants.
This paper investigates how vision-language models (VLMs) perform spatial reasoning—the binding of objects to spatial relations. It reveals that VLMs rely on two concurrent mechanisms: a dominant one where the vision encoder encodes object layout globally across visual tokens (extending into background regions), and a secondary one where the language model backbone forms ordering representations over object tokens. The finding that enhancing these vision-derived spatial representations improves performance without fine-tuning challenges the prevailing focus on LM backbones and highlights the critical role of vision encoders in multimodal reasoning.
This paper proposes a bold interdisciplinary bridge between holographic string dualities and artificial intelligence, hypothesizing that AI tasks such as language modeling can be viewed as particle trajectory prediction on graphs admitting a holographically dual "string" description. Drawing on the AdS/CFT correspondence, the authors conjecture that word metrics on $S_n$ Cayley graphs correspond to areas under lattice paths in dual planar polygons, verified computationally via their CayleyPy library.
The paper tackles partition-constrained subset selection for 'close-to-submodular' objectives—specifically α-weakly DR-submodular and (γ,β)-weakly submodular functions—where existing distorted local-search methods suffer from prohibitive query complexity (˜O(1/ϵ^6)) and require prior knowledge of structural parameters. The authors propose the Multinoulli Extension (ME), a continuous relaxation that learns multinoulli priors for each partition block, enabling lossless rounding without submodularity assumptions. They develop offline (Multinoulli-SCG) and online (Multinoulli-OSCG/OSGA) algorithms achieving tight approximation guarantees with O(1/ϵ^2) query complexity and O(√T) regret, respectively.
FluidWorld tackles the quadratic cost and lack of spatial inductive bias in Transformer-based world models by replacing self-attention with reaction-diffusion PDEs. The core innovation is using PDE integration itself—governed by a discretized Laplacian and learned reaction terms—as the predictive engine, rather than as a physical simulator. This proof-of-concept demonstrates that at $\sim$800K parameters, such physics-inspired dynamics match or exceed attention and convolutional recurrence on spatial coherence metrics while offering $O(N)$ complexity, though at slower training speeds.
This paper introduces BHDD, the first public benchmark dataset for handwritten Burmese digits. Myanmar script's distinctive circular letterforms—originally developed for writing on palm leaves—create recognition challenges distinct from Latin digits, with pairs like 0 and 1 differing only by whether a circle is closed. The authors release 87,561 verified images (28×28 grayscale, MNIST-compatible format) from over 150 contributors, with writer-independent train/test splits and baseline models reaching up to 99.83% accuracy.
This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment capability for low-resource languages where traditional embedding-based methods fail due to lack of training data.
HamVision proposes using damped harmonic oscillator dynamics as a structured inductive bias for medical image analysis. The core idea is that phase-space decomposition yields three representations—position $q$ (features), momentum $p$ (gradients), and energy $H = rac{1}{2}|z|^2$ (saliency)—that serve both segmentation and classification tasks without modifying the shared bottleneck. This physics-constrained approach aims to replace generic learned transformations with interpretable, dynamics-based feature extraction across diverse medical imaging modalities.
Psychiatric symptom identification from social media requires expensive expert annotation and suffers from inconsistent labeling across platforms. SynSym addresses this by using GPT-4o to generate synthetic training data across four stages: symptom concept expansion, dual-style (clinical/colloquial) expression generation, clinically-grounded multi-symptom composition, and LLM-based quality filtering. The framework produces 18,254 samples covering 14 DSM-5 symptoms, enabling models to match real-data performance and generalize across diverse social media platforms.
This paper addresses a subtle but critical issue in latent diffusion models (LDMs): VAE tokenizers tend to collapse latent variance toward zero to minimize reconstruction error, creating overly compact manifolds that are brittle against sampling perturbations. The authors propose a Variance Expansion (VE) loss that adaptively counteracts this collapse via an inverse-variance term $\mathcal{L}_{\text{var}} = 1/(\sigma^2 + \delta)$, allowing the latent space to absorb stochastic diffusion noise while maintaining reconstruction fidelity. The work achieves state-of-the-art FID 1.18 on ImageNet 256$\times$256 and provides both theoretical grounding and empirical validation across multiple architectures.
This paper tackles the memory explosion problem in high-rank DoRA fine-tuning. At $d_{in}=8192$ and rank $r=384$, computing the row-wise norm $\|\mathbf{W}+s\mathbf{B}\mathbf{A}\|_{\text{row}}$ via standard materialization consumes ~512 MB per module—prohibitive for large models with hundreds of adapted layers. The authors propose a factored norm decomposition that reduces the computation to $\mathcal{O}(d_{out}r+r^2)$ intermediates plus fused Triton kernels that collapse the composition into a single pass. On 8–32B vision-language models, this yields 1.5–2.0× speedups and up to 77 GB VRAM savings without numerical drift.
This paper addresses uncertainty quantification (UQ) for distribution-to-distribution flow matching, a setting where models map between well-defined source and target distributions (e.g., unperturbed to drug-treated cell images) rather than noise-to-data. The authors propose Bayesian Stochastic Flow Matching (BSFM), which combines Stochastic Flow Matching (SFM) for capturing aleatoric uncertainty via learnable diffusion terms, with MCD-Antithetic—a scalable Bayesian method using Monte Carlo Dropout and antithetic sampling—to decompose total uncertainty into aleatoric and epistemic components for reliable out-of-distribution (OOD) detection in scientific imaging.