Feed - arxlens

0

Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images

cs.CV Jiatong Xia, Lingqiao Liu · Mar 22, 2026

The paper presents a training-free pipeline for reconstructing instance-aware 3D scenes from 10-20 unposed RGB images and rendering novel views using diffusion. It combines MV-DUSt3R for geometry, SAM for 2D segmentation with warping-based cross-view unification, and the See3D diffusion model for inpainting holes in point-cloud projections. The system enables object-level editing by manipulating the point cloud directly, avoiding per-scene optimization.

We introduce a novel, training-free system for reconstructing, understanding, and rendering 3D indoor scenes from a sparse set of unposed RGB images. Unlike traditional radiance field approaches that require dense views and per-scene optimization, our pipeline achieves high-fidelity results without any training or pose preprocessing. The system integrates three key innovations: (1) A robust point cloud reconstruction module that filters unreliable geometry using a warping-based anomaly removal strategy; (2) A warping-guided 2D-to-3D instance lifting mechanism that propagates 2D segmentation masks into a consistent, instance-aware 3D representation; and (3) A novel rendering approach that projects the point cloud into new views and refines the renderings with a 3D-aware diffusion model. Our method leverages the generative power of diffusion to compensate for missing geometry and enhances realism, especially under sparse input conditions. We further demonstrate that object-level scene editing such as instance removal can be naturally supported in our pipeline by modifying only the point cloud, enabling the synthesis of consistent, edited views without retraining. Our results establish a new direction for efficient, editable 3D content generation without relying on scene-specific optimization. Project page: https://jiatongxia.github.io/TID3R/

Read abstractHide abstract

0

OrbitStream: Training-Free Adaptive 360-degree Video Streaming via Semantic Potential Fields

cs.NI cs.CV cs.MM Aizierjiang Aiersilan, Zhangfei Yang · Mar 22, 2026

OrbitStream addresses adaptive 360° video streaming for teleoperation by proposing a training-free framework that combines semantic scene understanding with robust control theory. It formulates viewport prediction as a Gravitational Viewport Prediction (GVP) problem where semantic objects (pedestrians, vehicles) generate potential fields that "attract" user gaze with task-relevant mass, while a Saturation-Based Proportional-Derivative (PD) Controller handles bitrate adaptation. This offers an interpretable, zero-shot alternative to black-box Deep Reinforcement Learning methods for safety-critical systems where deployment constraints prohibit lengthy training.

Adaptive 360{\deg} video streaming for teleoperation faces dual challenges: viewport prediction under uncertain gaze patterns and bitrate adaptation over volatile wireless channels. While data-driven and Deep Reinforcement Learning (DRL) methods achieve high Quality of Experience (QoE), their "black-box" nature and reliance on training data can limit deployment in safety-critical systems. To address this, we propose OrbitStream, a training-free framework that combines semantic scene understanding with robust control theory. We formulate viewport prediction as a Gravitational Viewport Prediction (GVP) problem, where semantic objects generate potential fields that attract user gaze. Furthermore, we employ a Saturation-Based Proportional-Derivative (PD) Controller for buffer regulation. On object-rich teleoperation traces, OrbitStream achieves a 94.7\% zero-shot viewport prediction accuracy without user-specific profiling, approaching trajectory-extrapolation baselines ($\sim$98.5\%). Across 3,600 Monte Carlo simulations on diverse network traces, OrbitStream yields a mean QoE of 2.71. It ranks second among 12 evaluated algorithms, close to the top-performing BOLA-E (2.80) while outperforming FastMPC (1.84). The system exhibits an average decision latency of 1.01 ms with minimal rebuffering events. By providing competitive QoE with interpretability and zero training overhead, OrbitStream demonstrates that physics-based control, combined with semantic modeling, offers a practical solution for 360{\deg} streaming in teleoperation.

Read abstractHide abstract

0

On the Challenges and Opportunities of Learned Sparse Retrieval for Code

cs.IR cs.CL Simon Lupart, Maxime Louis, Thibault Formal et al. · Mar 23, 2026

Code retrieval currently relies on dense embeddings, but this paper proposes SPLADE-Code, the first large-scale learned sparse retrieval (LSR) family for code search (600M–8B parameters). The authors address unique challenges including subword fragmentation, semantic gaps between natural language and code, and latency issues from long code documents. Their lightweight single-stage training achieves 75.4 nDCG@10 on MTEB Code under 1B parameters (state-of-the-art for that size) and 79.0 with 8B parameters, while enabling sub-millisecond retrieval via inverted indices.

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

Read abstractHide abstract

0

RAMPAGE: RAndomized Mid-Point for debiAsed Gradient Extrapolation

cs.LG math.OC Abolfazl Hashemi · Mar 23, 2026

RAMPAGE addresses discretization bias in Extragradient (EG) methods for variational inequalities by replacing the deterministic midpoint with randomized sampling. The core idea uses uniform sampling to construct an unbiased estimator of the continuous-time flow integral, while RAMPAGE+ leverages antithetic variates to eliminate first-order variance terms. This matters for training GANs and other non-conservative games where EG's $\mathcal{O}(\eta^2)$ bias causes divergence in highly nonlinear regimes.

A celebrated method for Variational Inequalities (VIs) is Extragradient (EG), which can be viewed as a standard discrete-time integration scheme. With this view in mind, in this paper we show that EG may suffer from discretization bias when applied to non-linear vector fields, conservative or otherwise. To resolve this discretization shortcoming, we introduce RAndomized Mid-Point for debiAsed Gradient Extrapolation (RAMPAGE) and its variance-reduced counterpart, RAMPAGE+ which leverages antithetic sampling. In contrast with EG, both methods are unbiased. Furthermore, leveraging negative correlation, RAMPAGE+ acts as an unbiased, geometric path-integrator that completely removes internal first-order terms from the variance, provably improving upon RAMPAGE. We further demonstrate that both methods enjoy provable $\mathcal{O}(1/k)$ convergence guarantees for a range of problems including root finding under co-coercive, co-hypomonotone, and generalized Lipschitzness regimes. Furthermore, we introduce symmetrically scaled variants to extend our results to constrained VIs. Finally, we provide convergence guarantees of both methods for stochastic and deterministic smooth convex-concave games. Somewhat interestingly, despite being a randomized method, RAMPAGE+ attains purely deterministic bounds for a number of the studied settings.

Read abstractHide abstract

0

Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles

cs.CL Adi Gabay, Gabriel Stanovsky, Liat Peterfreund · Mar 22, 2026

This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles like the Muddy Children problem. The authors persuasively argue that memorization is better understood as a special case of reduction, where models map new instances to known problems. They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is required. The work reframes how we interpret LLM performance on canonical reasoning benchmarks and highlights that strong accuracy on classic puzzles may mask a lack of genuine reasoning capability.

Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.

Read abstractHide abstract

0

Consistent but Dangerous: Per-Sample Safety Classification Reveals False Reliability in Medical Vision-Language Models

cs.CV Binesh Sadanandan, Vahid Behzadan · Mar 22, 2026

Medical vision-language models (VLMs) are increasingly evaluated for consistency—the invariance of predictions under paraphrased prompts—as a proxy for clinical reliability. This paper demonstrates that consistency alone is a fundamentally flawed safety metric because models can achieve perfect consistency by learning text shortcuts while completely ignoring the input image. The authors introduce a four-quadrant per-sample taxonomy that jointly evaluates consistency and image reliance, revealing that models optimized for low flip rates often shift samples into a 'Dangerous' quadrant where predictions are stable, accurate, and confident yet unchanged when the image is removed. Their findings expose a critical deployment trap: standard evaluation pipelines risk preferentially selecting models that appear reliable while being decision-invariant to visual evidence.

Consistency under paraphrase, the property that semantically equivalent prompts yield identical predictions, is increasingly used as a proxy for reliability when deploying medical vision-language models (VLMs). We show this proxy is fundamentally flawed: a model can achieve perfect consistency by relying on text patterns rather than the input image. We introduce a four-quadrant per-sample safety taxonomy that jointly evaluates consistency (stable predictions across paraphrased prompts) and image reliance (predictions that change when the image is removed). Samples are classified as Ideal (consistent and image-reliant), Fragile (inconsistent but image-reliant), Dangerous (consistent but not image-reliant), or Worst (inconsistent and not image-reliant). Evaluating five medical VLM configurations across two chest X-ray datasets (MIMIC-CXR, PadChest), we find that LoRA fine-tuning dramatically reduces flip rates but shifts a majority of samples into the Dangerous quadrant: LLaVA-Rad Base achieves a 1.5% flip rate on PadChest while 98.5% of its samples are Dangerous. Critically, Dangerous samples exhibit high accuracy (up to 99.6%) and low entropy, making them invisible to standard confidence-based screening. We observe a negative correlation between flip rate and Dangerous fraction (r = -0.89, n=10) and recommend that deployment evaluations always pair consistency checks with a text-only baseline: a single additional forward pass that exposes the false reliability trap.

Read abstractHide abstract

0

Task-Specific Efficiency Analysis: When Small Language Models Outperform Large Language Models

cs.CL cs.LG Jinghan Cao, Yu Ma, Xinjin Li et al. · Mar 22, 2026

This paper addresses the high computational cost of deploying Large Language Models (LLMs) in resource-constrained environments by introducing the Performance-Efficiency Ratio (PER), a novel metric that integrates accuracy, throughput, memory, and latency via geometric mean normalization. The authors evaluate 16 open-source language models ranging from 0.5B to 72B parameters across five NLP tasks (IMDB, HellaSwag, ARC-Easy, SQuAD 2.0, and GSM8K), concluding that small models (0.5–3B parameters) consistently achieve superior PER scores compared to their larger counterparts.

Large Language Models achieve remarkable performance but incur substantial computational costs unsuitable for resource-constrained deployments. This paper presents the first comprehensive task-specific efficiency analysis comparing 16 language models across five diverse NLP tasks. We introduce the Performance-Efficiency Ratio (PER), a novel metric integrating accuracy, throughput, memory, and latency through geometric mean normalization. Our systematic evaluation reveals that small models (0.5--3B parameters) achieve superior PER scores across all given tasks. These findings establish quantitative foundations for deploying small models in production environments prioritizing inference efficiency over marginal accuracy gains.

Read abstractHide abstract

0

CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning

cs.LG q-bio.QM Dongxia Wu, Shiye Su, Yuhui Zhang et al. · Mar 23, 2026

Virtual cell modeling aims to simulate cellular responses to drug perturbations in silico, but existing flow-matching models optimize only pixel-level reconstruction and can produce biologically implausible outputs like nuclei outside cytoplasm. CellFluxRL addresses this by post-training the state-of-the-art CellFlux model with reinforcement learning, using seven manually designed reward functions spanning biological function (mode of action), structural validity (nuclear containment), and morphological statistics (size/count). The approach reveals a systematic framework for enforcing physical constraints through differentiable optimization, achieving consistent improvements across all biological metrics while maintaining image quality.

Building virtual cells with generative models to simulate cellular behavior in silico is emerging as a promising paradigm for accelerating drug discovery. However, prior image-based generative approaches can produce implausible cell images that violate basic physical and biological constraints. To address this, we propose to post-train virtual cell models with reinforcement learning (RL), leveraging biologically meaningful evaluators as reward functions. We design seven rewards spanning three categories-biological function, structural validity, and morphological correctness-and optimize the state-of-the-art CellFlux model to yield CellFluxRL. CellFluxRL consistently improves over CellFlux across all rewards, with further performance boosts from test-time scaling. Overall, our results present a virtual cell modeling framework that enforces physically-based constraints through RL, advancing beyond "visually realistic" generations towards "biologically meaningful" ones.

Read abstractHide abstract

0

TiCo: Time-Controllable Training for Spoken Dialogue Models

cs.CL cs.AI eess.AS Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu et al. · Mar 23, 2026

TiCo tackles a critical gap in spoken dialogue models: the inability to control response duration, which is essential for time-constrained scenarios like driving assistants or emergency healthcare. Unlike text length control, speech duration depends on complex factors including phonetics, prosody, and speaking rate. The paper proposes Spoken Time Markers (STMs)—special tokens like <15.0 seconds> inserted during generation—to enable real-time temporal awareness. Using a two-stage post-training framework (self-generated supervised fine-tuning followed by reinforcement learning with verifiable rewards), TiCo equips models to estimate elapsed time and adjust content dynamically to meet target durations.

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, where controlling response duration can improve interaction quality. However, despite their strong ability to generate natural spoken responses, existing models lack time awareness and struggle to follow duration-related instructions (e.g., "Please generate a response lasting about 15 seconds"). Through an empirical evaluation of both open-source and commercial SDMs, we show that they frequently fail to satisfy such time-control requirements. TiCo addresses this limitation by enabling models to estimate elapsed speaking time during generation through Spoken Time Markers (STM) (e.g., <10.6 seconds>). These markers help the model maintain awareness of time and adjust the remaining content to meet the target duration. TiCo is simple and efficient: it requires only a small amount of data and no additional question-answer pairs, relying instead on self-generation and reinforcement learning. Experimental results show that TiCo significantly improves adherence to duration constraints while preserving response quality.

Read abstractHide abstract

0

DATASHI: A Parallel English-Tashlhiyt Corpus for Orthography Normalization and Low-Resource Language Processing

cs.CL Nasser-Eddine Monir, Zakaria Baou · Mar 23, 2026

DATASHI is a new parallel corpus for Tashlhiyt, a critically under-resourced Amazigh language spoken by millions in Morocco but lacking standardized digital resources. The paper introduces 5,000 English–Tashlhiyt sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, designed to benchmark orthography normalization. Using this corpus, the authors evaluate five state-of-the-art LLMs (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) on the normalization task, finding that even the best model (Gemini-2.5-Pro) achieves only moderate accuracy (35.5% WER) and struggles with gemination and emphatic consonants.

DATASHI is a new parallel English-Tashlhiyt corpus that fills a critical gap in computational resources for Amazigh languages. It contains 5,000 sentence pairs, including a 1,500-sentence subset with expert-standardized and non-standard user-generated versions, enabling systematic study of orthographic diversity and normalization. This dual design supports text-based NLP tasks - such as tokenization, translation, and normalization - and also serves as a foundation for read-speech data collection and multimodal alignment. Comprehensive evaluations with state-of-the-art Large Language Models (GPT-5, Claude-Sonnet-4.5, Gemini-2.5-Pro, Mistral, Qwen3-Max) show clear improvements from zero-shot to few-shot prompting, with Gemini-2.5-Pro achieving the lowest word and character-level error rates and exhibiting robust cross-lingual generalization. A fine-grained analysis of edit operations - deletions, substitutions, and insertions - across phonological classes (geminates, emphatics, uvulars, and pharyngeals) further highlights model-specific sensitivities to marked Tashlhiyt features and provides new diagnostic insights for low-resource Amazigh orthography normalization.

Read abstractHide abstract

0

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

cs.CV cs.LG Kelly Cui, Nikhil Prakash, Ayush Raina et al. · Mar 23, 2026

This paper investigates how vision-language models (VLMs) perform spatial reasoning—the binding of objects to spatial relations. It reveals that VLMs rely on two concurrent mechanisms: a dominant one where the vision encoder encodes object layout globally across visual tokens (extending into background regions), and a secondary one where the language model backbone forms ordering representations over object tokens. The finding that enhancing these vision-derived spatial representations improves performance without fine-tuning challenges the prevailing focus on LM backbones and highlights the critical role of vision encoders in multimodal reasoning.

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.

Read abstractHide abstract

0

CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

hep-th cs.AI cs.LG A. Chervov, F. Levkovich-Maslyuk, A. Smolensky et al. · Mar 23, 2026

This paper proposes a bold interdisciplinary bridge between holographic string dualities and artificial intelligence, hypothesizing that AI tasks such as language modeling can be viewed as particle trajectory prediction on graphs admitting a holographically dual "string" description. Drawing on the AdS/CFT correspondence, the authors conjecture that word metrics on $S_n$ Cayley graphs correspond to areas under lattice paths in dual planar polygons, verified computationally via their CayleyPy library.

This is the fourth paper in the CayleyPy project, which applies AI methods to the exploration of large graphs. In this work, we suggest the existence of a new discrete version of holographic string dualities for this setup, and discuss their relevance to AI systems and mathematics. Many modern AI tasks -- such as those addressed by GPT-style language models or RL systems -- can be viewed as direct analogues of predicting particle trajectories on graphs. We investigate this problem for a large family of Cayley graphs, for which we show that surprisingly it admits a dual description in terms of discrete strings. We hypothesize that such dualities may extend to a range of AI systems where they can lead to more efficient computational approaches. In particular, string holographic images of states are proposed as natural candidates for data embeddings, motivated by the "complexity = volume" principle in AdS/CFT. For Cayley graphs of the symmetric group S_n, our results indicate that the corresponding dual objects are flat, planar polygons. The diameter of the graph is equal to the number of integer points inside the polygon scaled by n. Vertices of the graph can be mapped holographically to paths inside the polygon, and the usual graph distances correspond to the area under the paths, thus directly realising the "complexity = volume" paradigm. We also find evidence for continuous CFTs and dual strings in the large n limit. We confirm this picture and other aspects of the duality in a large initial set of examples. We also present new datasets (obtained by a combination of ML and conventional tools) which should be instrumental in establishing the duality for more general cases.

Read abstractHide abstract

0

Multinoulli Extension: A Lossless Continuous Relaxation for Partition-Constrained Subset Selection

cs.LG math.OC Qixin Zhang, Wei Huang, Yan Sun et al. · Mar 23, 2026

The paper tackles partition-constrained subset selection for 'close-to-submodular' objectives—specifically α-weakly DR-submodular and (γ,β)-weakly submodular functions—where existing distorted local-search methods suffer from prohibitive query complexity (˜O(1/ϵ^6)) and require prior knowledge of structural parameters. The authors propose the Multinoulli Extension (ME), a continuous relaxation that learns multinoulli priors for each partition block, enabling lossless rounding without submodularity assumptions. They develop offline (Multinoulli-SCG) and online (Multinoulli-OSCG/OSGA) algorithms achieving tight approximation guarantees with O(1/ϵ^2) query complexity and O(√T) regret, respectively.

Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. More specifically, when the objective function is monotone $\alpha$-weakly DR-submodular or $(\gamma,\beta)$-weakly submodular, our Multinoulli-SCG algorithm can attain a value of $(1-e^{-\alpha})\text{OPT}-\epsilon$ or $(\frac{\gamma^{2}(1-e^{-(\beta(1-\gamma)+\gamma^2)})}{\beta(1-\gamma)+\gamma^2})\text{OPT}-\epsilon$ with only $O(1/\epsilon^{2})$ function evaluations, where OPT denotes the optimal value. The cornerstone of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the concerned partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Furthermore, based on our proposed ME, we also present two novel online algorithms, namely, Multinoulli-OSCG and Multinoulli-OSGA, for the unexplored online subset selection problems over partition constraints.

Read abstractHide abstract

0

FluidWorld: Reaction-Diffusion Dynamics as a Predictive Substrate for World Models

cs.LG Fabien Polly · Mar 22, 2026

FluidWorld tackles the quadratic cost and lack of spatial inductive bias in Transformer-based world models by replacing self-attention with reaction-diffusion PDEs. The core innovation is using PDE integration itself—governed by a discretized Laplacian and learned reaction terms—as the predictive engine, rather than as a physical simulator. This proof-of-concept demonstrates that at $\sim$800K parameters, such physics-inspired dynamics match or exceed attention and convolutional recurrence on spatial coherence metrics while offering $O(N)$ complexity, though at slower training speeds.

World models learn to predict future states of an environment, enabling planning and mental simulation. Current approaches default to Transformer-based predictors operating in learned latent spaces. This comes at a cost: O(N^2) computation and no explicit spatial inductive bias. This paper asks a foundational question: is self-attention necessary for predictive world modeling, or can alternative computational substrates achieve comparable or superior results? I introduce FluidWorld, a proof-of-concept world model whose predictive dynamics are governed by partial differential equations (PDEs) of reaction-diffusion type. Instead of using a separate neural network predictor, the PDE integration itself produces the future state prediction. In a strictly parameter-matched three-way ablation on unconditional UCF-101 video prediction (64x64, ~800K parameters, identical encoder, decoder, losses, and data), FluidWorld is compared against both a Transformer baseline (self-attention) and a ConvLSTM baseline (convolutional recurrence). While all three models converge to comparable single-step prediction loss, FluidWorld achieves 2x lower reconstruction error, produces representations with 10-15% higher spatial structure preservation and 18-25% more effective dimensionality, and critically maintains coherent multi-step rollouts where both baselines degrade rapidly. All experiments were conducted on a single consumer-grade PC (Intel Core i5, NVIDIA RTX 4070 Ti), without any large-scale compute. These results establish that PDE-based dynamics, which natively provide O(N) spatial complexity, adaptive computation, and global spatial coherence through diffusion, are a viable and parameter-efficient alternative to both attention and convolutional recurrence for world modeling.

Read abstractHide abstract

0

BHDD: A Burmese Handwritten Digit Dataset

cs.CV cs.CL Swan Htet Aung, Hein Htet, Htoo Say Wah Khaing et al. · Mar 23, 2026

This paper introduces BHDD, the first public benchmark dataset for handwritten Burmese digits. Myanmar script's distinctive circular letterforms—originally developed for writing on palm leaves—create recognition challenges distinct from Latin digits, with pairs like 0 and 1 differing only by whether a circle is closed. The authors release 87,561 verified images (28×28 grayscale, MNIST-compatible format) from over 150 contributors, with writer-independent train/test splits and baseline models reaching up to 99.83% accuracy.

We introduce the Burmese Handwritten Digit Dataset (BHDD), a collection of 87,561 grayscale images of handwritten Burmese digits in ten classes. Each image is 28x28 pixels, following the MNIST format. The training set has 60,000 samples split evenly across classes; the test set has 27,561 samples with class frequencies as they arose during collection. Over 150 people of different ages and backgrounds contributed samples. We analyze the dataset's class distribution, pixel statistics, and morphological variation, and identify digit pairs that are easily confused due to the round shapes of the Myanmar script. Simple baselines (an MLP, a two-layer CNN, and an improved CNN with batch normalization and augmentation) reach 99.40%, 99.75%, and 99.83% test accuracy respectively. BHDD is available under CC BY-SA 4.0 at https://github.com/baseresearch/BHDD

Read abstractHide abstract

0

Graph Fusion Across Languages using Large Language Models

cs.CL cs.IR Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan · Mar 22, 2026

This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment capability for low-resource languages where traditional embedding-based methods fail due to lack of training data.

Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.

Read abstractHide abstract

0

HamVision: Hamiltonian Dynamics as Inductive Bias for Medical Image Analysis

cs.CV cs.LG Mohamed A Mabrok · Mar 22, 2026

HamVision proposes using damped harmonic oscillator dynamics as a structured inductive bias for medical image analysis. The core idea is that phase-space decomposition yields three representations—position $q$ (features), momentum $p$ (gradients), and energy $H = rac{1}{2}|z|^2$ (saliency)—that serve both segmentation and classification tasks without modifying the shared bottleneck. This physics-constrained approach aims to replace generic learned transformations with interpretable, dynamics-based feature extraction across diverse medical imaging modalities.

We present HamVision, a framework for medical image analysis that uses the damped harmonic oscillator, a fundamental building block of signal processing, as a structured inductive bias for both segmentation and classification tasks. The oscillator's phase-space decomposition yields three functionally distinct representations: position~$q$ (feature content), momentum~$p$ (spatial gradients that encode boundary and texture information), and energy $H = \tfrac{1}{2}|z|^2$ (a parameter-free saliency map). These representations emerge from the dynamics, not from supervision, and can be exploited by different task-specific heads without any modification to the oscillator itself. For segmentation, energy gates the skip connections while momentum injects boundary information at every decoder level (HamSeg). For classification, the three representations are globally pooled and concatenated into a phase-space feature vector (HamCls). We evaluate HamVision across ten medical imaging benchmarks spanning five imaging modalities. On segmentation, HamSeg achieves state-of-the-art Dice scores on ISIC\,2018 (89.38\%), ISIC\,2017 (88.40\%), TN3K (87.05\%), and ACDC (92.40\%), outperforming most baselines with only 8.57M parameters. On classification, HamCls achieves state-of-the-art accuracy on BloodMNIST (98.85\%) and PathMNIST (96.65\%), and competitive results on the remaining MedMNIST datasets against MedMamba and MedViT. Diagnostic analysis confirms that the oscillator's momentum consistently encodes an interior$\,{>}\,$boundary$\,{>}\,$exterior gradient for segmentation and that the energy map correlates with discriminative regions for classification, properties that emerge entirely from the Hamiltonian dynamics. Code is available at https://github.com/Minds-R-Lab/hamvision.

Read abstractHide abstract

0

SynSym: A Synthetic Data Generation Framework for Psychiatric Symptom Identification

cs.CL Migyeong Kang, Jihyun Kim, Hyolim Jeon et al. · Mar 23, 2026

Psychiatric symptom identification from social media requires expensive expert annotation and suffers from inconsistent labeling across platforms. SynSym addresses this by using GPT-4o to generate synthetic training data across four stages: symptom concept expansion, dual-style (clinical/colloquial) expression generation, clinically-grounded multi-symptom composition, and LLM-based quality filtering. The framework produces 18,254 samples covering 14 DSM-5 symptoms, enabling models to match real-data performance and generalize across diverse social media platforms.

Psychiatric symptom identification on social media aims to infer fine-grained mental health symptoms from user-generated posts, allowing a detailed understanding of users' mental states. However, the construction of large-scale symptom-level datasets remains challenging due to the resource-intensive nature of expert labeling and the lack of standardized annotation guidelines, which in turn limits the generalizability of models to identify diverse symptom expressions from user-generated text. To address these issues, we propose SynSym, a synthetic data generation framework for constructing generalizable datasets for symptom identification. Leveraging large language models (LLMs), SynSym constructs high-quality training samples by (1) expanding each symptom into sub-concepts to enhance the diversity of generated expressions, (2) producing synthetic expressions that reflect psychiatric symptoms in diverse linguistic styles, and (3) composing realistic multi-symptom expressions, informed by clinical co-occurrence patterns. We validate SynSym on three benchmark datasets covering different styles of depressive symptom expression. Experimental results demonstrate that models trained solely on the synthetic data generated by SynSym perform comparably to those trained on real data, and benefit further from additional fine-tuning with real data. These findings underscore the potential of synthetic data as an alternative resource to real-world annotations in psychiatric symptom modeling, and SynSym serves as a practical framework for generating clinically relevant and realistic symptom expressions.

Read abstractHide abstract

0

Taming Sampling Perturbations with Variance Expansion Loss for Latent Diffusion Models

cs.CV Qifan Li, Xingyu Zhou, Jinhua Zhang et al. · Mar 22, 2026

This paper addresses a subtle but critical issue in latent diffusion models (LDMs): VAE tokenizers tend to collapse latent variance toward zero to minimize reconstruction error, creating overly compact manifolds that are brittle against sampling perturbations. The authors propose a Variance Expansion (VE) loss that adaptively counteracts this collapse via an inverse-variance term $\mathcal{L}_{\text{var}} = 1/(\sigma^2 + \delta)$, allowing the latent space to absorb stochastic diffusion noise while maintaining reconstruction fidelity. The work achieves state-of-the-art FID 1.18 on ImageNet 256$\times$256 and provides both theoretical grounding and empirical validation across multiple architectures.

Latent diffusion models have emerged as the dominant framework for high-fidelity and efficient image generation, owing to their ability to learn diffusion processes in compact latent spaces. However, while previous research has focused primarily on reconstruction accuracy and semantic alignment of the latent space, we observe that another critical factor, robustness to sampling perturbations, also plays a crucial role in determining generation quality. Through empirical and theoretical analyses, we show that the commonly used $\beta$-VAE-based tokenizers in latent diffusion models, tend to produce overly compact latent manifolds that are highly sensitive to stochastic perturbations during diffusion sampling, leading to visual degradation. To address this issue, we propose a simple yet effective solution that constructs a latent space robust to sampling perturbations while maintaining strong reconstruction fidelity. This is achieved by introducing a Variance Expansion loss that counteracts variance collapse and leverages the adversarial interplay between reconstruction and variance expansion to achieve an adaptive balance that preserves reconstruction accuracy while improving robustness to stochastic sampling. Extensive experiments demonstrate that our approach consistently enhances generation quality across different latent diffusion architectures, confirming that robustness in latent space is a key missing ingredient for stable and faithful diffusion sampling.

Read abstractHide abstract

0

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

cs.LG stat.ML Alexandra Zelenin, Alexandra Zhuravlyova · Mar 23, 2026

This paper tackles the memory explosion problem in high-rank DoRA fine-tuning. At $d_{in}=8192$ and rank $r=384$, computing the row-wise norm $\|\mathbf{W}+s\mathbf{B}\mathbf{A}\|_{\text{row}}$ via standard materialization consumes ~512 MB per module—prohibitive for large models with hundreds of adapted layers. The authors propose a factored norm decomposition that reduces the computation to $\mathcal{O}(d_{out}r+r^2)$ intermediates plus fused Triton kernels that collapse the composition into a single pass. On 8–32B vision-language models, this yields 1.5–2.0× speedups and up to 77 GB VRAM savings without numerical drift.

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.

Read abstractHide abstract

Nothing here yet