Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.CL Carolin Holtermann, Minh Duc Bui, Kaitlyn Zhou et al. · Mar 23, 2026

Audio-enabled large language models promise to democratize AI access for users with disabilities or limited literacy, but voice interfaces introduce immutable paralinguistic cues—pitch, timbre, prosody—that carry demographic signals. This paper demonstrates that state-of-the-art audio LLMs systematically discriminate based on speaker voice, assigning gender-stereotyped adjectives and professions solely from acoustic features. Crucially, the authors show that voice inputs amplify bias beyond text-only baselines, with models exhibiting stronger stereotypical associations when processing speech than when processing equivalent text with gendered name cues. The study establishes a causal link via pitch manipulation experiments and surveys 1,000 users to reveal that those who would benefit most from voice accessibility are often most hesitant about the attendant privacy and discrimination risks.

Hundreds of millions of people rely on large language models (LLMs) for education, work, and even healthcare. Yet these models are known to reproduce and amplify social biases present in their training data. Moreover, text-based interfaces remain a barrier for many, for example, users with limited literacy, motor impairments, or mobile-only devices. Voice interaction promises to expand accessibility, but unlike text, speech carries identity cues that users cannot easily mask, raising concerns about whether accessibility gains may come at the cost of equitable treatment. Here we show that audio-enabled LLMs exhibit systematic gender discrimination, shifting responses toward gender-stereotyped adjectives and occupations solely on the basis of speaker voice, and amplifying bias beyond that observed in text-based interaction. Thus, voice interfaces do not merely extend text models to a new modality but introduce distinct bias mechanisms tied to paralinguistic cues. Complementary survey evidence ($n=1,000$) shows that infrequent chatbot users are most hesitant to undisclosed attribute inference and most likely to disengage when such practices are revealed. To demonstrate a potential mitigation strategy, we show that pitch manipulation can systematically regulate gender-discriminatory outputs. Overall, our findings reveal a critical tension in AI development: efforts to expand accessibility through voice interfaces simultaneously create new pathways for discrimination, demanding that fairness and accessibility be addressed in tandem.
0
cs.CL Haroun Elleuch, Salima Mdhaffar, Yannick Est\`eve et al. · Mar 23, 2026

SLURP-TN introduces a Spoken Language Understanding (SLU) dataset for Tunisian Arabic, a low-resource dialect. The authors translate and record six domains from the English SLURP corpus with 55 speakers across 18 geographic regions, emphasizing gender balance and code-switching phenomena. The dataset provides approximately five hours of audio across three acoustic conditions (clean, noisy, headphone) to enable robust benchmarking of ASR and SLU systems for dialectal Arabic.

Spoken Language Understanding (SLU) aims to extract the semantic information from the speech utterance of user queries. It is a core component in a task-oriented dialogue system. With the spectacular progress of deep neural network models and the evolution of pre-trained language models, SLU has obtained significant breakthroughs. However, only a few high-resource languages have taken advantage of this progress due to the absence of SLU resources. In this paper, we seek to mitigate this obstacle by introducing SLURP-TN. This dataset was created by recording 55 native speakers uttering sentences in Tunisian dialect, manually translated from six SLURP domains. The result is an SLU Tunisian dialect dataset that comprises 4165 sentences recorded into around 5 hours of acoustic material. We also develop a number of Automatic Speech Recognition and SLU models exploiting SLUTP-TN. The Dataset and baseline models are available at: https://huggingface.co/datasets/Elyadata/SLURP-TN.
0
cs.CVcs.AI Junrong Guo, Shancheng Fang, Yadong Qu et al. · Mar 23, 2026

This paper tackles the visual perception gap in automated text layout generation. While existing Multimodal Large Language Models (MLLMs) generate layout code (SVG/JSON) to render text on images, they operate blind to the actual rendered output, producing layouts with overlapping text, poor contrast, or misalignment. The authors propose Visual Feedback Layout Model (VFLM), which closes the loop by rendering generated SVGs and feeding the visual results back to the model for iterative reflection and refinement. The framework uses a two-stage pipeline—cold-start supervised fine-tuning followed by reinforcement learning with GRPO—and introduces a specialized layout reward model trained on fine-grained quality hierarchies. A surprising finding is that simple outcome-based rewards outperform complex process-oriented rewards that explicitly encode step-wise incentives.

Recent advances in Multimodal Large Language Models (MLLMs) have enabled automated generation of structured layouts from natural language descriptions. Existing methods typically follow a code-only paradigm that generates code to represent layouts, which are then rendered by graphic engines to produce final images. However, they are blind to the rendered visual outcome, making it difficult to guarantee readability and aesthetics. In this paper, we identify visual feedback as a critical factor in layout generation and propose Visual Feedback Layout Model (VFLM), a self-improving framework that leverages visual feedback iterative refinement. VFLM is capable of performing adaptive reflective generation, which leverages visual information to reflect on previous issues and iteratively generates outputs until satisfactory quality is achieved. It is achieved through reinforcement learning with a visually grounded reward model that incorporates OCR accuracy. By rewarding only the final generated outcome, we can effectively stimulate the model's iterative and reflective generative capabilities. Experiments across multiple benchmarks show that VFLM consistently outperforms advanced MLLMs, existing layout models, and code-only baselines, establishing visual feedback as critical for design-oriented MLLMs. Our code and data are available at https://github.com/FolSpark/VFLM.
0
cs.LG Tian Xia · Mar 23, 2026

This paper tackles the Long-to-Short (L2S) model merging problem: combining a base LLM with a long-chain-of-thought reasoning model to preserve accuracy while drastically reducing output length. The core contribution is a theoretical framework proving that merging error is bounded by the per-layer Hessian norm (Proposition 1), which motivates using the diagonal Fisher Information Matrix (FIM) as a data-free proxy for assigning layer-adaptive merging coefficients. The resulting FIM-TIES method achieves state-of-the-art results on 5 of 6 benchmarks without requiring any domain-specific calibration data.

Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient -- an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9\%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.
0
cs.CL Guanbao Liang, Yuanchen Bei, Sheng Zhou et al. · Mar 23, 2026

MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO as a continual experience accumulation process using a dual-memory mechanism—Correct-Template Memory ($\mathcal{E}_{\mathrm{CTM}}$) for successful strategies and Error-Pattern Memory ($\mathcal{E}_{\mathrm{EPM}}$) for failure modes—that enables cross-task generalization while reducing optimization costs by approximately 57% compared to strong baselines.

Automatic prompt optimization is a promising approach for adapting large language models (LLMs) to downstream tasks, yet existing methods typically search for a specific prompt specialized to a fixed task. This paradigm limits generalization across heterogeneous queries and prevents models from accumulating reusable prompting knowledge over time. In this paper, we propose MemAPO, a memory-driven framework that reconceptualizes prompt optimization as generalizable and self-evolving experience accumulation. MemAPO maintains a dual-memory mechanism that distills successful reasoning trajectories into reusable strategy templates while organizing incorrect generations into structured error patterns that capture recurrent failure modes. Given a new prompt, the framework retrieves both relevant strategies and failure patterns to compose prompts that promote effective reasoning while discouraging known mistakes. Through iterative self-reflection and memory editing, MemAPO continuously updates its memory, enabling prompt optimization to improve over time rather than restarting from scratch for each task. Experiments on diverse benchmarks show that MemAPO consistently outperforms representative prompt optimization baselines while substantially reducing optimization cost.
0
cs.LG Zakaria Mhammedi, James Cohan · Mar 23, 2026

Hard-exploration problems in RL—such as Montezuma’s Revenge and sparse-reward robotic control—require finding rare trajectories where standard RL fails. This paper argues that using policy optimization to maximize intrinsic rewards is unnecessarily inefficient for mere state coverage. Instead, it proposes Go-With-Uncertainty (GowU), a tree-search method that decouples exploration from exploitation: it uses epistemic uncertainty to drive a Go-With-The-Winner particle population search, then distills discovered trajectories via supervised backward learning. The approach achieves state-of-the-art scores on hard Atari benchmarks with an order of magnitude fewer environment interactions than intrinsic-motivation baselines, and solves high-dimensional continuous-control tasks (Adroit, AntMaze) from pixels without demonstrations.

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before.
0
astro-ph.COastro-ph.IMcs.LG Hubert Leterme, Andreas Tersenov, Jalal Fadili et al. · Mar 23, 2026

Paper introduces PnPMass, a plug-and-play framework for weak lensing mass mapping that reconciles reconstruction accuracy with practical deployment constraints of upcoming Stage-IV surveys. The key innovation is a carefully chosen data-fidelity operator that decouples denoiser training from observation-specific noise statistics, enabling a single trained model to handle varying survey conditions without retraining. Coupled with moment-network-based uncertainty quantification and conformal calibration, the method offers fast inference with coverage guarantees, addressing limitations of both end-to-end deep learning and costly MCMC sampling approaches.

Upcoming stage-IV surveys such as Euclid and Rubin will deliver vast amounts of high-precision data, opening new opportunities to constrain cosmological models with unprecedented accuracy. A key step in this process is the reconstruction of the dark matter distribution from noisy weak lensing shear measurements. Current deep learning-based mass mapping methods achieve high reconstruction accuracy, but either require retraining a model for each new observed sky region (limiting practicality) or rely on slow MCMC sampling. Efficient exploitation of future survey data therefore calls for a new method that is accurate, flexible, and fast at inference. In addition, uncertainty quantification with coverage guarantees is essential for reliable cosmological parameter estimation. We introduce PnPMass, a plug-and-play approach for weak lensing mass mapping. The algorithm produces point estimates by alternating between a gradient descent step with a carefully chosen data fidelity term, and a denoising step implemented with a single deep learning model trained on simulated data corrupted by Gaussian white noise. We also propose a fast, sampling-free uncertainty quantification scheme based on moment networks, with calibrated error bars obtained through conformal prediction to ensure coverage guarantees. Finally, we benchmark PnPMass against both model-driven and data-driven mass mapping techniques. PnPMass achieves performance close to that of state-of-the-art deep-learning methods while offering fast inference (converging in just a few iterations) and requiring only a single training phase, independently of the noise covariance of the observations. It therefore combines flexibility, efficiency, and reconstruction accuracy, while delivering tighter error bars than existing approaches, making it well suited for upcoming weak lensing surveys.
0
cs.CV Bahram Mohammadi, Ta Duc Huy, Afrouz Sheikholeslami et al. · Mar 22, 2026

Brain tumor segmentation from MRI scans faces challenges because the three target sub-regions—Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET)—have ambiguous visual boundaries. This paper proposes TextCSP, a hierarchical framework that integrates radiological reports by replacing the standard single global text embedding with sub-region-aware prompts and a soft cascade decoder that enforces the anatomical hierarchy $ET \subset TC \subset WT$. The method builds on the TextBraTS baseline and achieves modest gains on its paired MRI-text dataset.

Brain tumor segmentation remains challenging because the three standard sub-regions, i.e., whole tumor (WT), tumor core (TC), and enhancing tumor (ET), often exhibit ambiguous visual boundaries. Integrating radiological description texts with imaging has shown promise. However, most multimodal approaches typically compress a report into a single global text embedding shared across all sub-regions, overlooking their distinct clinical characteristics. We propose TextCSP (text-modulated soft cascade architecture), a hierarchical text-guided framework that builds on the TextBraTS baseline with three novel components: (1) a text-modulated soft cascade decoder that predicts WT->TC->ET in a coarse-to-fine manner consistent with their anatomical containment hierarchy. (2) sub-region-aware prompt tuning, which uses learnable soft prompts with a LoRA-adapted BioBERT encoder to generate specialized text representations tailored for each sub-region; (3) text-semantic channel modulators that convert the aforementioned representations into channel-wise refinement signals, enabling the decoder to emphasize features aligned with clinically described patterns. Experiments on the TextBraTS dataset demonstrate consistent improvements across all sub-regions against state-of-the-art methods by 1.7% and 6% on the main metrics Dice and HD95.
0
cs.LG Shiyan Hu, Jianxin Jin, Yang Shu et al. · Mar 23, 2026

MindTS tackles multimodal time series anomaly detection by fusing numerical time series with text from two sources: endogenous text (LLM-generated descriptions of patch statistics) and exogenous text (external reports). The core idea is to align these heterogeneous modalities via contrastive learning and filter textual redundancy using an Information Bottleneck-inspired content condenser before cross-modal reconstruction. This matters because real-world anomalies often manifest in contextual text (e.g., policy changes affecting stock prices) that pure numerical models miss.

Time series anomaly detection plays a critical role in many dynamic systems. Despite its importance, previous approaches have primarily relied on unimodal numerical data, overlooking the importance of complementary information from other modalities. In this paper, we propose a novel multimodal time series anomaly detection model (MindTS) that focuses on addressing two key challenges: (1) how to achieve semantically consistent alignment across heterogeneous multimodal data, and (2) how to filter out redundant modality information to enhance cross-modal interaction effectively. To address the first challenge, we propose Fine-grained Time-text Semantic Alignment. It integrates exogenous and endogenous text information through cross-view text fusion and a multimodal alignment mechanism, achieving semantically consistent alignment between time and text modalities. For the second challenge, we introduce Content Condenser Reconstruction, which filters redundant information within the aligned text modality and performs cross-modal reconstruction to enable interaction. Extensive experiments on six real-world multimodal datasets demonstrate that the proposed MindTS achieves competitive or superior results compared to existing methods. The code is available at: https://github.com/decisionintelligence/MindTS.
0
cs.LGcs.CV Alois Bachmann · Mar 23, 2026

dynActivation addresses the rigidity of fixed activation functions by introducing per-layer trainable scalars that interpolate between a base nonlinearity and a linear path. The method adds only two parameters per layer ($\alpha_i$ and $\beta_i$) via $f_i(x) = \text{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, allowing adaptive nonlinearity allocation across depth. Results show strong vision benchmarks (+14% on CIFAR-10), robustness to extreme depth scaling (95%+ accuracy on 75-layer MNIST), and faster convergence (24% AUC reduction), though LLM perplexity gains vanish in long-run training.

This paper proposes $\mathrm{dynActivation}$, a per-layer trainable activation defined as $f_i(x) = \mathrm{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, where $\alpha_i$ and $\beta_i$ are lightweight learned scalars that interpolate between the base nonlinearity and a linear path and $\mathrm{BaseAct}(x)$ resembles any ReLU-like function. The static and dynamic ReLU-like variants are then compared across multiple vision tasks, language modeling tasks, and ablation studies. The results suggest that dynActivation variants tend to linearize deep layers while maintaining high performance, which can improve training efficiency by up to $+54\%$ over ReLU. On CIFAR-10, dynActivation(Mish) improves over static Mish by up to $+14.02\%$ on AttentionCNN with an average improvment by $+6.00\%$, with a $24\%$ convergence-AUC reduction relative to Mish (2120 vs. 2785). In a 1-to-75-layer MNIST depth-scaling study, dynActivation never drops below $95\%$ test accuracy ($95.3$--$99.3\%$), while ReLU collapses below $80\%$ at 25 layers. Under FGSM at $\varepsilon{=}0.08$, dynActivation(Mish) incurs a $55.39\%$ accuracy drop versus $62.79\%$ for ReLU ($7.40\%$ advantage). Transferred to language modeling, a new proposed dynActGLU-variant achieves a $10.3\%$ relative perplexity reduction over SwiGLU at 5620 steps (4.047 vs. 4.514), though the gap vanishes at 34300 steps.
0
cs.LG Zhipeng Zhang, Zhenjie Yao, Kai Li et al. · Mar 23, 2026

This paper investigates a fundamental failure mode in learning systems: when feedback reliability is unobservable (latent), standard algorithms can converge stably to systematically incorrect solutions while exhibiting normal optimization behavior (decreasing loss, vanishing gradients). The authors formalize this as a scale-dependent identifiability problem—single-step feedback is insufficient to distinguish reliable from biased experience, yet trajectory-level statistics carry separable signals. They propose the Monitor–Trust–Regulator (MTR) framework, which maintains a slow-timescale trust variable inferred from learning dynamics to modulate updates, enabling recovery from persistent bias.

Learning systems are typically optimized by minimizing loss or maximizing reward, assuming that improvements in these signals reflect progress toward the true objective. However, when feedback reliability is unobservable, this assumption can fail, and learning algorithms may converge stably to incorrect solutions. This failure arises because single-step feedback does not reveal whether an experience is informative or persistently biased. When information is aggregated over learning trajectories, however, systematic differences between reliable and unreliable regimes can emerge. We propose a Monitor-Trust-Regulator (MTR) framework that infers reliability from learning dynamics and modulates updates through a slow-timescale trust variable. Across reinforcement learning and supervised learning settings, standard algorithms exhibit stable optimization behavior while learning incorrect solutions under latent unreliability, whereas trust-modulated systems reduce bias accumulation and improve recovery. These results suggest that learning dynamics are not only optimization traces but also a source of information about feedback reliability.
0
cs.CL Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser · Mar 22, 2026

The paper introduces Dissimilar Span Detection (DSD), a new task aimed at explaining Semantic Textual Similarity (STS) scores by identifying specific text spans that differ in meaning between sentence pairs. To enable this research, the authors release the Span Similarity Dataset (SSD), containing 1,000 semi-automatically annotated samples validated by human annotators. They evaluate a broad range of approaches—including LIME, SHAP, proprietary LLMs, and supervised token classifiers—and find that while LLMs achieve the highest performance, the task remains challenging even for state-of-the-art models, with potential applications in paraphrase detection and fact-checking.

Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.
0
cs.CLcs.AI Ireh Kim, Tesia Sker, Chanwoo Kim · Mar 23, 2026

Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic document-level data from summarization corpora via LLM augmentation, filters this data using sacreBLEU, COMET, and LaBSE cosine similarity, and then trains models first on sentence-level data before adapting to the filtered document corpus.

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.
0
cs.LGcs.CR Devashish Chaudhary, Sutharshan Rajasegarar, Shiva Raj Pokhrel et al. · Mar 23, 2026

This paper addresses the challenge of detecting network attacks in IoT environments while preserving data privacy and minimizing communication overhead. The authors propose a federated learning framework using lightweight autoencoders deployed directly on Raspberry Pi edge devices to detect anomalies in real-time through reconstruction error $\mathcal{E}(t)=\|x_{t}-\hat{x}_{t}\|^{2}$. A real-world testbed with ZigBee-enabled sensor nodes was constructed to evaluate the approach against redirection attacks, demonstrating that federated training can match centralized performance while significantly reducing data transmission from 4.5 MB to 378 KB.

The rapid expansion of the Internet of Things (IoT) and its integration with backbone networks have heightened the risk of security breaches. Traditional centralized approaches to anomaly detection, which require transferring large volumes of data to central servers, suffer from privacy, scalability, and latency limitations. This paper proposes a lightweight autoencoder-based anomaly detection framework designed for deployment on resource-constrained edge devices, enabling real-time detection while minimizing data transfer and preserving privacy. Federated learning is employed to train models collaboratively across distributed devices, where local training occurs on edge nodes and only model weights are aggregated at a central server. A real-world IoT testbed using Raspberry Pi sensor nodes was developed to collect normal and attack traffic data. The proposed federated anomaly detection system, implemented and evaluated on the testbed, demonstrates its effectiveness in accurately identifying network attacks. The communication overhead was reduced significantly while achieving comparable performance to the centralized method.
0
cs.ROcs.LG Ruiqi Xian, Jing Liang, He Yin et al. · Mar 23, 2026

2-4 sentences for scrolling feed.

Sections:
1. Verdict: Overall assessment - solid incremental contribution, hybrid approach is interesting, results are good but limited scope.
2. What holds up: Gaussian anchoring mechanism, two-stage design, ablation studies showing component effectiveness.
3. Main concerns: Single-frame limitation, dataset limitation (only SemanticKITTI), missing comparison with GaussianFormer, efficiency trade-offs not fully characterized, limited discussion of failure modes.
4. Evidence and comparison: Fair comparison with ETFormer/VoxFormer using same backbone, but missing key Gaussian baselines; ablations validate design choices; qualitative results show improvements.
5. Reproducibility: Good implementation details provided, standard dataset, but no code release mentioned; hyperparameters mostly specified.

Let me write the content now, ensuring I follow the formatting rules:
- Use LaTeX for math
- Keep JSON strings on single lines (use \n for line breaks)
- Include exact quotes with locators
- No markdown fences around JSON

We present \emph{GaussianSSC}, a two-stage, grid-native and triplane-guided approach to semantic scene completion (SSC) that injects the benefits of Gaussians without replacing the voxel grid or maintaining a separate Gaussian set. We introduce \emph{Gaussian Anchoring}, a sub-pixel, Gaussian-weighted image aggregation over fused FPN features that tightens voxel--image alignment and improves monocular occupancy estimation. We further convert point-like voxel features into a learned per-voxel Gaussian field and refine triplane features via a triplane-aligned \emph{Gaussian--Triplane Refinement} module that combines \emph{local gathering} (target-centric) and \emph{global aggregation} (source-centric). This directional, anisotropic support captures surface tangency, scale, and occlusion-aware asymmetry while preserving the efficiency of triplane representations. On SemanticKITTI~\cite{behley2019semantickitti}, GaussianSSC improves Stage~1 occupancy by +1.0\% Recall, +2.0\% Precision, and +1.8\% IoU over state-of-the-art baselines, and improves Stage~2 semantic prediction by +1.8\% IoU and +0.8\% mIoU.
0
cs.CVcs.AI Nour Alhuda Albashir, Lars Pernickel, Danial Hamoud et al. · Mar 23, 2026

Autonomous vehicles struggle with adverse weather perception. This paper proposes LRC-WeatherNet, a lightweight fusion network combining LiDAR, RADAR, and camera via early BEV fusion and mid-level gating to classify weather conditions in real-time. The approach achieves $86.66\%$ accuracy on the MSU-4S dataset with $7.13\,\mathrm{ms}$ inference, demonstrating that adaptive multi-modal fusion outperforms unimodal baselines, though dataset limitations restrict generalization to rare weather events.

Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors. While each sensor type offers unique strengths, such as RADAR robustness in poor visibility and LiDAR precision in clear conditions, they also suffer distinct limitations when exposed to environmental obstructions. This study proposes LRC-WeatherNet, a novel multi-sensor fusion framework that integrates LiDAR, RADAR, and camera data for real-time classification of weather conditions. By employing both early fusion using a unified Bird's Eye View representation and mid-level gated fusion of modality-specific feature maps, our approach adapts to the varying reliability of each sensor under changing weather. Evaluated on the extensive MSU-4S dataset covering nine weather types, LRC-WeatherNet achieves superior classification performance and computational efficiency, significantly outperforming unimodal baselines in adverse conditions. This work is the first to combine all three modalities for robust, real-time weather classification in autonomous driving. We release our trained models and source code in https://github.com/nouralhudaalbashir/LRC-WeatherNet.
0
cs.SDcs.LG Kazuki Matsumoto, Ren Uchida, Kohei Yatabe · Mar 23, 2026

Existing Lipschitz-constrained DNNs don't directly apply to audio amplitude modifiers (AMs) because the complex-valued reconstruction breaks continuity. This paper proves that AMs are generally not Lipschitz continuous, derives sufficient conditions for Lipschitz continuity (Assumption 3), and proposes LipsAM architectures that enforce these bounds via element-wise minimum and ReLU operations. The work matters because it enables certified robust amplitude modification and stabilizes Plug-and-Play algorithms where conventional AMs diverge.

The robustness of deep neural networks (DNNs) can be certified through their Lipschitz continuity, which has made the construction of Lipschitz-continuous DNNs an active research field. However, DNNs for audio processing have not been a major focus due to their poor compatibility with existing results. In this paper, we consider the amplitude modifier (AM), a popular architecture for handling audio signals, and propose its Lipschitz-continuous variants, which we refer to as LipsAM. We prove a sufficient condition for an AM to be Lipschitz continuous and propose two architectures as examples of LipsAM. The proposed architectures were applied to a Plug-and-Play algorithm for speech dereverberation, and their improved stability is demonstrated through numerical experiments.
0
cs.LGstat.ML Paolo Toccaceli · Mar 23, 2026

This paper addresses conditional distribution estimation for regression by proposing a non-parametric binning approach. Observations sorted by a one-dimensional covariate are partitioned into contiguous bins via dynamic programming, minimizing a closed-form leave-one-out CRPS cost function. The method produces conformal prediction sets with finite-sample marginal coverage guarantees and connects to Venn predictors, offering substantially narrower intervals than standard split-conformal methods on heteroscedastic and bimodal benchmarks.

We propose a method for non-parametric conditional distribution estimation based on partitioning covariate-sorted observations into contiguous bins and using the within-bin empirical CDF as the predictive distribution. Bin boundaries are chosen to minimise the total leave-one-out Continuous Ranked Probability Score (LOO-CRPS), which admits a closed-form cost function with $O(n^2 \log n)$ precomputation and $O(n^2)$ storage; the globally optimal $K$-partition is recovered by a dynamic programme in $O(n^2 K)$ time. Minimisation of Within-sample LOO-CRPS turns out to be inappropriate for selecting $K$ as it results in in-sample optimism. So we instead select $K$ by evaluating test CRPS on an alternating held-out split, which yields a U-shaped criterion with a well-defined minimum. Having selected $K^*$ and fitted the full-data partition, we form two complementary predictive objects: the Venn prediction band and a conformal prediction set based on CRPS as the nonconformity score, which carries a finite-sample marginal coverage guarantee at any prescribed level $\varepsilon$. On real benchmarks against split-conformal competitors (Gaussian split conformal, CQR, and CQR-QRF), the method produces substantially narrower prediction intervals while maintaining near-nominal coverage.
0
cs.LG Weilin Wan, Jingtao Han, Weizhong Zhang et al. · Mar 23, 2026

This paper tackles the combinatorial explosion in Mixture-of-Experts (MoE) architecture design, where traditional scaling laws either add too many variables to fit reliably or isolate MoE components while ignoring global interactions. The authors propose a holistic framework that uses algebraic constraints and a rank-preserving property of the hidden dimension $d$ to collapse the search space from $\mathcal{O}(n^{16})$ to manageable two-phase searches of $\mathcal{O}(n^3)+\mathcal{O}(n^2)$. They derive closed-form scaling laws mapping compute budgets to optimal configurations across $10^{18}$ to $3 \times 10^{20}$ FLOPs, revealing that near-optimal architectural bands widen at larger scales—providing actionable guidance for resource-efficient MoE deployment.

Scaling laws for Large Language Models govern macroscopic resource allocation, yet translating them into precise Mixture-of-Experts (MoE) architectural configurations remains an open problem due to the combinatorially vast design space. Existing MoE scaling studies are constrained by experimental budgets to either augment scaling formulas with extra MoE variables, risking unreliable fits, or fix all non-MoE factors, ignoring global interactions. We propose a reusable framework for holistic MoE architectural optimization that bridges this gap. We first show that FLOPs per token alone is an inadequate fairness metric for MoE models because differing computational densities across layer types can inflate parameters without proportional compute cost, and establish a joint constraint triad of FLOPs per token, active parameters, and total parameters. We then reduce the 16-dimensional architectural search space to two sequential low-dimensional phases through algebraic constraints and a rank-preserving property of the hidden dimension. Validated across hundreds of MoE models spanning six orders of magnitude in compute, our framework yields robust scaling laws that map any compute budget to a complete, optimal MoE architecture. A key finding is that the near-optimal configuration band widens with scale, giving practitioners quantitative flexibility to balance scaling law recommendations against infrastructure constraints.
0
cs.AIcs.CL Yiling Wu · Mar 23, 2026

This paper distinguishes different forms of reasoning by the structural properties they demand from underlying representational systems. The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for AI systems and cognitive science, providing a principled boundary between reasoning that can rely on associative approximations versus reasoning requiring structural guarantees.

Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.