Nothing here yet
Audio-enabled large language models promise to democratize AI access for users with disabilities or limited literacy, but voice interfaces introduce immutable paralinguistic cues—pitch, timbre, prosody—that carry demographic signals. This paper demonstrates that state-of-the-art audio LLMs systematically discriminate based on speaker voice, assigning gender-stereotyped adjectives and professions solely from acoustic features. Crucially, the authors show that voice inputs amplify bias beyond text-only baselines, with models exhibiting stronger stereotypical associations when processing speech than when processing equivalent text with gendered name cues. The study establishes a causal link via pitch manipulation experiments and surveys 1,000 users to reveal that those who would benefit most from voice accessibility are often most hesitant about the attendant privacy and discrimination risks.
SLURP-TN introduces a Spoken Language Understanding (SLU) dataset for Tunisian Arabic, a low-resource dialect. The authors translate and record six domains from the English SLURP corpus with 55 speakers across 18 geographic regions, emphasizing gender balance and code-switching phenomena. The dataset provides approximately five hours of audio across three acoustic conditions (clean, noisy, headphone) to enable robust benchmarking of ASR and SLU systems for dialectal Arabic.
This paper tackles the visual perception gap in automated text layout generation. While existing Multimodal Large Language Models (MLLMs) generate layout code (SVG/JSON) to render text on images, they operate blind to the actual rendered output, producing layouts with overlapping text, poor contrast, or misalignment. The authors propose Visual Feedback Layout Model (VFLM), which closes the loop by rendering generated SVGs and feeding the visual results back to the model for iterative reflection and refinement. The framework uses a two-stage pipeline—cold-start supervised fine-tuning followed by reinforcement learning with GRPO—and introduces a specialized layout reward model trained on fine-grained quality hierarchies. A surprising finding is that simple outcome-based rewards outperform complex process-oriented rewards that explicitly encode step-wise incentives.
This paper tackles the Long-to-Short (L2S) model merging problem: combining a base LLM with a long-chain-of-thought reasoning model to preserve accuracy while drastically reducing output length. The core contribution is a theoretical framework proving that merging error is bounded by the per-layer Hessian norm (Proposition 1), which motivates using the diagonal Fisher Information Matrix (FIM) as a data-free proxy for assigning layer-adaptive merging coefficients. The resulting FIM-TIES method achieves state-of-the-art results on 5 of 6 benchmarks without requiring any domain-specific calibration data.
MemAPO addresses a critical limitation in automatic prompt optimization (APO): existing methods frame optimization as an isolated search for task-specific prompts, preventing knowledge reuse across tasks. The paper proposes reframing APO as a continual experience accumulation process using a dual-memory mechanism—Correct-Template Memory ($\mathcal{E}_{\mathrm{CTM}}$) for successful strategies and Error-Pattern Memory ($\mathcal{E}_{\mathrm{EPM}}$) for failure modes—that enables cross-task generalization while reducing optimization costs by approximately 57% compared to strong baselines.
Hard-exploration problems in RL—such as Montezuma’s Revenge and sparse-reward robotic control—require finding rare trajectories where standard RL fails. This paper argues that using policy optimization to maximize intrinsic rewards is unnecessarily inefficient for mere state coverage. Instead, it proposes Go-With-Uncertainty (GowU), a tree-search method that decouples exploration from exploitation: it uses epistemic uncertainty to drive a Go-With-The-Winner particle population search, then distills discovered trajectories via supervised backward learning. The approach achieves state-of-the-art scores on hard Atari benchmarks with an order of magnitude fewer environment interactions than intrinsic-motivation baselines, and solves high-dimensional continuous-control tasks (Adroit, AntMaze) from pixels without demonstrations.
Paper introduces PnPMass, a plug-and-play framework for weak lensing mass mapping that reconciles reconstruction accuracy with practical deployment constraints of upcoming Stage-IV surveys. The key innovation is a carefully chosen data-fidelity operator that decouples denoiser training from observation-specific noise statistics, enabling a single trained model to handle varying survey conditions without retraining. Coupled with moment-network-based uncertainty quantification and conformal calibration, the method offers fast inference with coverage guarantees, addressing limitations of both end-to-end deep learning and costly MCMC sampling approaches.
Brain tumor segmentation from MRI scans faces challenges because the three target sub-regions—Whole Tumor (WT), Tumor Core (TC), and Enhancing Tumor (ET)—have ambiguous visual boundaries. This paper proposes TextCSP, a hierarchical framework that integrates radiological reports by replacing the standard single global text embedding with sub-region-aware prompts and a soft cascade decoder that enforces the anatomical hierarchy $ET \subset TC \subset WT$. The method builds on the TextBraTS baseline and achieves modest gains on its paired MRI-text dataset.
MindTS tackles multimodal time series anomaly detection by fusing numerical time series with text from two sources: endogenous text (LLM-generated descriptions of patch statistics) and exogenous text (external reports). The core idea is to align these heterogeneous modalities via contrastive learning and filter textual redundancy using an Information Bottleneck-inspired content condenser before cross-modal reconstruction. This matters because real-world anomalies often manifest in contextual text (e.g., policy changes affecting stock prices) that pure numerical models miss.
dynActivation addresses the rigidity of fixed activation functions by introducing per-layer trainable scalars that interpolate between a base nonlinearity and a linear path. The method adds only two parameters per layer ($\alpha_i$ and $\beta_i$) via $f_i(x) = \text{BaseAct}(x)(\alpha_i - \beta_i) + \beta_i x$, allowing adaptive nonlinearity allocation across depth. Results show strong vision benchmarks (+14% on CIFAR-10), robustness to extreme depth scaling (95%+ accuracy on 75-layer MNIST), and faster convergence (24% AUC reduction), though LLM perplexity gains vanish in long-run training.
This paper investigates a fundamental failure mode in learning systems: when feedback reliability is unobservable (latent), standard algorithms can converge stably to systematically incorrect solutions while exhibiting normal optimization behavior (decreasing loss, vanishing gradients). The authors formalize this as a scale-dependent identifiability problem—single-step feedback is insufficient to distinguish reliable from biased experience, yet trajectory-level statistics carry separable signals. They propose the Monitor–Trust–Regulator (MTR) framework, which maintains a slow-timescale trust variable inferred from learning dynamics to modulate updates, enabling recovery from persistent bias.
The paper introduces Dissimilar Span Detection (DSD), a new task aimed at explaining Semantic Textual Similarity (STS) scores by identifying specific text spans that differ in meaning between sentence pairs. To enable this research, the authors release the Span Similarity Dataset (SSD), containing 1,000 semi-automatically annotated samples validated by human annotators. They evaluate a broad range of approaches—including LIME, SHAP, proprietary LLMs, and supervised token classifiers—and find that while LLMs achieve the highest performance, the task remains challenging even for state-of-the-art models, with potential applications in paraphrase detection and fact-checking.
Large language models have historically lagged behind specialized encoder-decoder MT systems, but their superior context modeling makes them natural candidates for document-level translation. This paper tackles two key obstacles: the scarcity of high-quality document-level parallel corpora and LLM tendencies toward hallucinations and omissions. The authors propose a two-stage fine-tuning framework that first generates synthetic document-level data from summarization corpora via LLM augmentation, filters this data using sacreBLEU, COMET, and LaBSE cosine similarity, and then trains models first on sentence-level data before adapting to the filtered document corpus.
This paper addresses the challenge of detecting network attacks in IoT environments while preserving data privacy and minimizing communication overhead. The authors propose a federated learning framework using lightweight autoencoders deployed directly on Raspberry Pi edge devices to detect anomalies in real-time through reconstruction error $\mathcal{E}(t)=\|x_{t}-\hat{x}_{t}\|^{2}$. A real-world testbed with ZigBee-enabled sensor nodes was constructed to evaluate the approach against redirection attacks, demonstrating that federated training can match centralized performance while significantly reducing data transmission from 4.5 MB to 378 KB.
2-4 sentences for scrolling feed.
Sections:
1. Verdict: Overall assessment - solid incremental contribution, hybrid approach is interesting, results are good but limited scope.
2. What holds up: Gaussian anchoring mechanism, two-stage design, ablation studies showing component effectiveness.
3. Main concerns: Single-frame limitation, dataset limitation (only SemanticKITTI), missing comparison with GaussianFormer, efficiency trade-offs not fully characterized, limited discussion of failure modes.
4. Evidence and comparison: Fair comparison with ETFormer/VoxFormer using same backbone, but missing key Gaussian baselines; ablations validate design choices; qualitative results show improvements.
5. Reproducibility: Good implementation details provided, standard dataset, but no code release mentioned; hyperparameters mostly specified.
Let me write the content now, ensuring I follow the formatting rules:
- Use LaTeX for math
- Keep JSON strings on single lines (use \n for line breaks)
- Include exact quotes with locators
- No markdown fences around JSON
Autonomous vehicles struggle with adverse weather perception. This paper proposes LRC-WeatherNet, a lightweight fusion network combining LiDAR, RADAR, and camera via early BEV fusion and mid-level gating to classify weather conditions in real-time. The approach achieves $86.66\%$ accuracy on the MSU-4S dataset with $7.13\,\mathrm{ms}$ inference, demonstrating that adaptive multi-modal fusion outperforms unimodal baselines, though dataset limitations restrict generalization to rare weather events.
Existing Lipschitz-constrained DNNs don't directly apply to audio amplitude modifiers (AMs) because the complex-valued reconstruction breaks continuity. This paper proves that AMs are generally not Lipschitz continuous, derives sufficient conditions for Lipschitz continuity (Assumption 3), and proposes LipsAM architectures that enforce these bounds via element-wise minimum and ReLU operations. The work matters because it enables certified robust amplitude modification and stabilizes Plug-and-Play algorithms where conventional AMs diverge.
This paper addresses conditional distribution estimation for regression by proposing a non-parametric binning approach. Observations sorted by a one-dimensional covariate are partitioned into contiguous bins via dynamic programming, minimizing a closed-form leave-one-out CRPS cost function. The method produces conformal prediction sets with finite-sample marginal coverage guarantees and connects to Venn predictors, offering substantially narrower intervals than standard split-conformal methods on heteroscedastic and bimodal benchmarks.
This paper tackles the combinatorial explosion in Mixture-of-Experts (MoE) architecture design, where traditional scaling laws either add too many variables to fit reliably or isolate MoE components while ignoring global interactions. The authors propose a holistic framework that uses algebraic constraints and a rank-preserving property of the hidden dimension $d$ to collapse the search space from $\mathcal{O}(n^{16})$ to manageable two-phase searches of $\mathcal{O}(n^3)+\mathcal{O}(n^2)$. They derive closed-form scaling laws mapping compute budgets to optimal configurations across $10^{18}$ to $3 \times 10^{20}$ FLOPs, revealing that near-optimal architectural bands widen at larger scales—providing actionable guidance for resource-efficient MoE deployment.
This paper distinguishes different forms of reasoning by the structural properties they demand from underlying representational systems. The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for AI systems and cognitive science, providing a principled boundary between reasoning that can rely on associative approximations versus reasoning requiring structural guarantees.