Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.LGcs.AI Bulent Haznedar, Levent Karacan · Mar 23, 2026

FISformer proposes replacing the dot-product self-attention in Transformers with a Sugeno-type Fuzzy Inference System (FIS) for time series forecasting. Instead of computing query-key similarities, the model fuzzifies tokens using learnable Gaussian membership functions, applies fuzzy rules, and defuzzifies to produce interaction weights. The paper suggests this approach captures uncertainty and nonlinearity better than standard attention, reporting state-of-the-art results on benchmarks like ETT, ECL, and Weather.

Transformers have achieved remarkable progress in time series forecasting, yet their reliance on deterministic dot-product attention limits their capacity to model uncertainty and nonlinear dependencies across multivariate temporal dimensions. To address this limitation, we propose FISFormer, a Fuzzy Inference System-driven Transformer that replaces conventional attention with a FIS Interaction mechanism. In this framework, each query-key pair undergoes a fuzzy inference process for every feature dimension, where learnable membership functions and rule-based reasoning estimate token-wise relational strengths. These FIS-derived interaction weights capture uncertainty and provide interpretable, continuous mappings between tokens. A softmax operation is applied along the token axis to normalize these weights, which are then combined with the corresponding value features through element-wise multiplication to yield the final context-enhanced token representations. This design fuses the interpretability and uncertainty modeling of fuzzy logic with the representational power of Transformers. Extensive experiments on multiple benchmark datasets demonstrate that FISFormer achieves superior forecasting accuracy, noise robustness, and interpretability compared to state-of-the-art Transformer variants, establishing fuzzy inference as an effective alternative to conventional attention mechanisms.
0
cs.CRcs.AIcs.MM Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee · Mar 23, 2026

This paper exposes a critical vulnerability in Multimodal Large Language Models (MLLMs): safety alignment fails when harmful intent is embedded in structured visual narratives. The authors introduce ComicJailbreak, a benchmark of 1,167 three-panel comics where panels 1–2 establish narrative context and panel 3 contains a blank speech bubble filled with a paraphrased harmful goal. The model is prompted to "complete the comic" by generating the fourth panel. Across 15 state-of-the-art MLLMs, comic-based attacks achieve ensemble success rates exceeding 90% on Gemini-family models and 85%+ on most open-source models—substantially outperforming plain-text and random-image baselines. The work also reveals that existing defenses (AdaShield, Attack as Defense) trigger severe over-refusal on benign prompts, and that automated safety judges are unreliable on sensitive-but-benign content.

Multimodal Large Language Models (MLLMs) extend text-only LLMs with visual reasoning, but also introduce new safety failure modes under visually grounded instructions. We study comic-template jailbreaks that embed harmful goals inside simple three-panel visual narratives and prompt the model to role-play and "complete the comic." Building on JailbreakBench and JailbreakV, we introduce ComicJailbreak, a comic-based jailbreak benchmark with 1,167 attack instances spanning 10 harm categories and 5 task setups. Across 15 state-of-the-art MLLMs (six commercial and nine open-source), comic-based attacks achieve success rates comparable to strong rule-based jailbreaks and substantially outperform plain-text and random-image baselines, with ensemble success rates exceeding 90% on several commercial models. Then, with the existing defense methodologies, we show that these methods are effective against the harmful comics, they will induce a high refusal rate when prompted with benign prompts. Finally, using automatic judging and targeted human evaluation, we show that current safety evaluators can be unreliable on sensitive but non-harmful content. Our findings highlight the need for safety alignment robust to narrative-driven multimodal jailbreaks.
0
cs.CVcs.AI Pawe{\l} Borsukiewicz, Daniele Lunghi, Melissa Tessa et al. · Mar 23, 2026

Adversarial Camouflage proposes a wearable privacy defense against facial recognition by optimizing simple face paint patterns (stripes or chevrons) to adversarially minimize embedding similarities across multiple recognition models. The core idea is to restrict the attack space to low-dimensional, user-reproducible geometric parameters (color, angle, width) that can be painted onto semantically valid facial regions, enabling protesters and privacy-conscious individuals to evade automated surveillance without specialized equipment.

While the rapid development of facial recognition algorithms has enabled numerous beneficial applications, their widespread deployment has raised significant concerns about the risks of mass surveillance and threats to individual privacy. In this paper, we introduce \textit{Adversarial Camouflage} as a novel solution for protecting users' privacy. This approach is designed to be efficient and simple to reproduce for users in the physical world. The algorithm starts by defining a low-dimensional pattern space parameterized by color, shape, and angle. Optimized patterns, once found, are projected onto semantically valid facial regions for evaluation. Our method maximizes recognition error across multiple architectures, ensuring high cross-model transferability even against black-box systems. It significantly degrades the performance of all tested state-of-the-art face recognition models during simulations and demonstrates promising results in real-world human experiments, while revealing differences in model robustness and evidence of attack transferability across architectures.
0
stat.MLcs.LGmath.DG Sing-Yuan Yeh, Yi-An Wu, Hau-Tieng Wu et al. · Mar 22, 2026

Vector Diffusion Maps (VDM) capture pairwise connection relationships in complex datasets via the Graph Connection Laplacian, but eigenvalue decomposition costs $O(n^{2.81})$, prohibiting large-scale applications. This paper proposes LA-VDM (Landmark Accelerated VDM), which constrains diffusion through landmark points and introduces a novel two-stage normalization scheme with parameters $\alpha$ and $\beta$ to handle non-uniform sampling densities in both data and landmarks. Under a manifold model with the frame bundle structure, the authors prove that LA-VDM asymptotically converges to the connection Laplacian while reducing complexity to $O(nm^2)$, enabling applications to datasets with millions of points.

We propose a landmark-constrained algorithm, LA-VDM (Landmark Accelerated Vector Diffusion Maps), to accelerate the Vector Diffusion Maps (VDM) framework built upon the Graph Connection Laplacian (GCL), which captures pairwise connection relationships within complex datasets. LA-VDM introduces a novel two-stage normalization that effectively address nonuniform sampling densities in both the data and the landmark sets. Under a manifold model with the frame bundle structure, we show that we can accurately recover the parallel transport with landmark-constrained diffusion from a point cloud, and hence asymptotically LA-VDM converges to the connection Laplacian. The performance and accuracy of LA-VDM are demonstrated through experiments on simulated datasets and an application to nonlocal image denoising.
0
cs.ROcs.LGcs.MA Ebasa Temesgen, Nathnael Minyelshowa, Lebsework Negash · Mar 22, 2026

This paper proposes a multi-UAV architecture for autonomous precision agriculture that combines centralized mission planning with decentralized execution control. It integrates coverage path planning, battery-aware task allocation, CNN-based image processing, and battery swapping stations to enable end-to-end farm monitoring. The work targets large-scale agricultural operations with minimal human intervention, claiming advantages in fault-tolerance, scalability, and user-friendliness.

The use of unmanned aerial vehicles (UAVs) in precision agriculture has seen a huge increase recently. As such, systems that aim to apply various algorithms on the field need a structured framework of abstractions. This paper defines the various tasks of the UAVs in precision agriculture and model them into an architectural framework. The presented architecture is built on the context that there will be minimal physical intervention to do the tasks defined with multiple coordinated and cooperative UAVs. Various tasks such as image processing, path planning, communication, data acquisition, and field mapping are employed in the architecture to provide an efficient system. Besides, different limitation for applying Multi-UAVs in precision agriculture has been considered in designing the architecture. The architecture provides an autonomous end-to-end solution, starting from mission planning, data acquisition and image processing framework that is highly efficient and can enable farmers to comprehensively deploy UAVs onto their lands. Simulation and field tests shows that the architecture offers a number of advantages that include fault-tolerance, robustness, developer and user-friendliness.
0
cs.CVcs.AIcs.LG Peter Fasogbon, Ugurcan Budak, Patrice Rondao Alface et al. · Mar 23, 2026

This paper tackles camera-agnostic pruning of 3D Gaussian splats for standardized interchange settings like MPEG I-3DGS, where training images, camera parameters, and gradients are unavailable. The authors propose BetaDescPrune, a one-shot post-training method that computes Hybrid Splat Feature Histogram (HSFH) descriptors to capture local geometric and appearance consistency, then models pruning decisions via Beta-distributed evidence with uncertainty-aware confidence scoring. The core insight is that reliable splat importance can be inferred from intrinsic neighborhood structure alone without rendering supervision.

The pruning of 3D Gaussian splats is essential for reducing their complexity to enable efficient storage, transmission, and downstream processing. However, most of the existing pruning strategies depend on camera parameters, rendered images, or view-dependent measures. This dependency becomes a hindrance in emerging camera-agnostic exchange settings, where splats are shared directly as point-based representations (e.g., .ply). In this paper, we propose a camera-agnostic, one-shot, post-training pruning method for 3D Gaussian splats that relies solely on attribute-derived neighbourhood descriptors. As our primary contribution, we introduce a hybrid descriptor framework that captures structural and appearance consistency directly from the splat representation. Building on these descriptors, we formulate pruning as a statistical evidence estimation problem and introduce a Beta evidence model that quantifies per-splat reliability through a probabilistic confidence score. Experiments conducted on standardized test sequences defined by the ISO/IEC MPEG Common Test Conditions (CTC) demonstrate that our approach achieves substantial pruning while preserving reconstruction quality, establishing a practical and generalizable alternative to existing camera-dependent pruning strategies.
0
cs.LGstat.COstat.ME Foo Hui-Mean, Yuan-chin I Chang · Mar 22, 2026

ALMAB-DC unifies Gaussian process active learning, multi-armed bandit scheduling, and asynchronous distributed computing to tackle expensive black-box optimization in sequential experimental design. The framework targets dose-finding, spatial field estimation, and ML/engineering tasks, claiming superior sample efficiency and near-linear parallel speedups up to $K=16$ agents. While the modular architecture and ablation analyses are rigorous, all empirical results derive from calibrated surrogate emulators rather than live systems, substantially limiting external validity.

Sequential experimental design under expensive, gradient-free objectives is a central challenge in computational statistics: evaluation budgets are tightly constrained and information must be extracted efficiently from each observation. We propose \textbf{ALMAB-DC}, a GP-based sequential design framework combining active learning, multi-armed bandits (MAB), and distributed asynchronous computing for expensive black-box experimentation. A Gaussian process surrogate with uncertainty-aware acquisition identifies informative query points; a UCB or Thompson-sampling bandit controller allocates evaluations across parallel workers; and an asynchronous scheduler handles heterogeneous runtimes. We present cumulative regret bounds for the bandit components and characterize parallel scalability via Amdahl's Law. We validate ALMAB-DC on five benchmarks. On the two statistical experimental-design tasks, ALMAB-DC achieves lower simple regret than Equal Spacing, Random, and D-optimal designs in dose--response optimization, and in adaptive spatial field estimation matches the Greedy Max-Variance benchmark while outperforming Latin Hypercube Sampling; at $K=4$ the distributed setting reaches target performance in one-quarter of sequential wall-clock rounds. On three ML/engineering tasks (CIFAR-10 HPO, CFD drag minimization, MuJoCo RL), ALMAB-DC achieves 93.4\% CIFAR-10 accuracy (outperforming BOHB by 1.7\,pp and Optuna by 1.1\,pp), reduces airfoil drag to $C_D = 0.059$ (36.9\% below Grid Search), and improves RL return by 50\% over Grid Search. All advantages over non-ALMAB baselines are statistically significant under Bonferroni-corrected Mann--Whitney $U$ tests. Distributed execution achieves $7.5\times$ speedup at $K = 16$ agents, consistent with Amdahl's Law.
0
cs.AI Caio Azevedo, Stefano Sabatini, Sascha Hornauer et al. · Mar 23, 2026

Multi-agent trajectory prediction requires models to understand complex future interactions between agents. This paper proposes braid prediction, an auxiliary task where models classify the crossing relationships (below/over/no\_crossing) between every pair of agents using shared mode embeddings from a DETR-style decoder. By training jointly to predict these topological braid labels alongside trajectories, the model gains future-interaction awareness with negligible inference overhead.

To safely operate, an autonomous vehicle must know the future behavior of a potentially high number of interacting agents around it, a task often posed as multi-agent trajectory prediction. Many previous attempts to model social interactions and solve the joint prediction task either add extensive computational requirements or rely on heuristics to label multi-agent behavior types. Braid theory, in contrast, provides a powerful exact descriptor of multi-agent behavior by projecting future trajectories into braids that express how trajectories cross with each other over time; a braid then corresponds to a specific mode of coordination between the multiple agents in the future. In past work, braids have been used lightly to reason about interacting agents and restrict the attention window of predicted agents. We show that leveraging more fully the expressivity of the braid representation and using it to condition the trajectories themselves leads to even further gains in joint prediction performance, with negligible added complexity either in training or at inference time. We do so by proposing a novel auxiliary task, braid prediction, done in parallel with the trajectory prediction task. By classifying edges between agents into their correct crossing types in the braid representation, the braid prediction task is able to imbue the model with improved social awareness, which is reflected in joint predictions that more closely adhere to the actual multi-agent behavior. This simple auxiliary task allowed us to obtain significant improvements in joint metrics on three separate datasets. We show how the braid prediction task infuses the model with future intention awareness, leading to more accurate joint predictions. Code is available at github.com/caiocj1/traj-pred-braid-theory.
0
cs.CVcs.AI Shuxian Zhao, Jie Gui, Baosheng Yu et al. · Mar 23, 2026

SteelDefectX introduces a vision-language dataset for steel defect detection that aggregates 7,778 images from four existing sources with novel coarse-to-fine textual annotations—ranging from class-level defect descriptions to sample-level attributes (shape, size, depth, position, contrast) generated via GPT-4o. The paper establishes a four-task benchmark showing that rich textual supervision improves cross-material transfer, though it reveals a tension where fine-grained annotations unexpectedly hurt few-shot performance.

Steel surface defect detection is essential for ensuring product quality and reliability in modern manufacturing. Current methods often rely on basic image classification models trained on label-only datasets, which limits their interpretability and generalization. To address these challenges, we introduce SteelDefectX, a vision-language dataset containing 7,778 images across 25 defect categories, annotated with coarse-to-fine textual descriptions. At the coarse-grained level, the dataset provides class-level information, including defect categories, representative visual attributes, and associated industrial causes. At the fine-grained level, it captures sample-specific attributes, such as shape, size, depth, position, and contrast, enabling models to learn richer and more detailed defect representations. We further establish a benchmark comprising four tasks, vision-only classification, vision-language classification, few/zero-shot recognition, and zero-shot transfer, to evaluate model performance and generalization. Experiments with several baseline models demonstrate that coarse-to-fine textual annotations significantly improve interpretability, generalization, and transferability. We hope that SteelDefectX will serve as a valuable resource for advancing research on explainable, generalizable steel surface defect detection. The data will be publicly available on https://github.com/Zhaosxian/SteelDefectX.
0
cs.AI Hunmin Do, Taejun Yoon, Kiyong Jung · Mar 23, 2026

This paper tackles the challenge of multi-party consensus-building in travel planning, where agents must negotiate conflicting subjective preferences rather than converge on objective truths. MIND (Multi-agent Inference for Negotiation Dialogue) introduces a Theory-of-Mind-inspired framework where agents infer hidden preference intensities (willingness scores $w \in [1,10]$) from linguistic cues and dynamically adjust their tone between warmth and toughness. The work matters because it extends Multi-Agent Debate (MAD) from factual domains to social coordination problems requiring compromise.

While Multi-Agent Debate (MAD) research has advanced, its efficacy in coordinating complex stakeholder interests such as travel planning remains largely unexplored. To bridge this gap, we propose MIND (Multi-agent Inference for Negotiation Dialogue), a framework designed to simulate realistic consensus-building among travelers with heterogeneous preferences. Grounded in the Theory of Mind (ToM), MIND introduces a Strategic Appraisal phase that infers opponent willingness (w) from linguistic nuances with 90.2% accuracy. Experimental results demonstrate that MIND outperforms traditional MAD frameworks, achieving a 20.5% improvement in High-w Hit and a 30.7% increase in Debate Hit-Rate, effectively prioritizing high-stakes constraints. Furthermore, qualitative evaluations via LLM-as-a-Judge confirm that MIND surpasses baselines in Rationality (68.8%) and Fluency (72.4%), securing an overall win rate of 68.3%. These findings validate that MIND effectively models human negotiation dynamics to derive persuasive consensus.
0
cs.CLcs.AI Pengfei Cao, Mingxuan Yang, Yubo Chen et al. · Mar 23, 2026

This paper introduces SemEval-2026 Task 12, Abductive Event Reasoning (AER), a shared task requiring systems to identify the most plausible direct cause of a target event from noisy multi-document evidence. The task is cast as an evidence-grounded multiple-choice benchmark with multiple correct answers allowed, capturing challenges like distributed evidence, indirect background factors, and semantically related distractors. With 122 participants and 518 submissions, it represents a significant community effort to benchmark real-world causal reasoning in long-context settings.

Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER).\footnote{The task data is available at https://github.com/sooo66/semeval2026-task12-dataset.git} The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple-choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
0
cs.CVcs.AI Sashuai Zhou, Qiang Zhou, Junpeng Ma et al. · Mar 23, 2026

SpatialReward addresses the persistent problem of spatial inconsistencies in text-to-image generation, where models produce globally plausible images with incorrect object positioning and relationships. The paper proposes a three-stage verifiable reward model that decomposes free-form prompts into structured constraints, verifies object attributes via expert detectors, and employs vision-language chain-of-thought reasoning to assess complex spatial layouts. Integrated into Flow-GRPO reinforcement learning for Stable Diffusion and FLUX, the approach significantly improves spatial consistency while maintaining overall image quality.

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
0
stat.MLcs.LGmath.ST Yingzhen Yang, Ping Li · Mar 22, 2026

This paper studies nonparametric regression for learning degree-$k_0$ spherical polynomials on the unit sphere $\mathbb{S}^{d-1}$ using over-parameterized two-layer neural networks. The authors propose a novel Gradient Descent with Projection (GDP) algorithm that constrains learning to the top $r_0 = \Theta(d^{k_0})$ eigenspaces of the Neural Tangent Kernel (NTK). The main result establishes a nearly minimax optimal risk bound of order $\log(4/\delta) \cdot \Theta(d^{k_0}/n)$, improving the sample complexity from previous polynomial-in-$1/\varepsilon$ rates to linear $1/\varepsilon$ scaling.

We study the problem of learning a low-degree spherical polynomial of degree $k_0 = \Theta(1) \ge 1$ defined on the unit sphere in $\RR^d$ by training an over-parameterized two-layer neural network with augmented feature in this paper. Our main result is the significantly improved sample complexity for learning such low-degree polynomials. We show that, for any regression risk $\eps \in (0, \Theta(d^{-k_0})]$, an over-parameterized two-layer neural network trained by a novel Gradient Descent with Projection (GDP) requires a sample complexity of $n \asymp \Theta( \log(4/\delta) \cdot d^{k_0}/\eps)$ with probability $1-\delta$ for $\delta \in (0,1)$, in contrast with the representative sample complexity $\Theta(d^{k_0} \max\set{\eps^{-2},\log d})$. Moreover, such sample complexity is nearly unimprovable since the trained network renders a nearly optimal rate of the nonparametric regression risk of the order $\log({4}/{\delta}) \cdot \Theta(d^{k_0}/{n})$ with probability at least $1-\delta$. On the other hand, the minimax optimal rate for the regression risk with a kernel of rank $\Theta(d^{k_0})$ is $\Theta(d^{k_0}/{n})$, so that the rate of the nonparametric regression risk of the network trained by GDP is nearly minimax optimal. In the case that the ground truth degree $k_0$ is unknown, we present a novel and provable adaptive degree selection algorithm which identifies the true degree and achieves the same nearly optimal regression rate. To the best of our knowledge, this is the first time that a nearly optimal risk bound is obtained by training an over-parameterized neural network with a popular activation function (ReLU) and algorithmic guarantee for learning low-degree spherical polynomials. Due to the feature learning capability of GDP, our results are beyond the regular Neural Tangent Kernel (NTK) limit.
0
cs.AI Naoshi Uchihira · Mar 23, 2026

This paper proposes the GenAI SECI model, an update to Nonaka & Takeuchi's classic SECI framework for knowledge creation, designed to leverage generative AI for managing workplace ("Gen-Ba") tacit knowledge. The central innovation is "Digital Fragmented Knowledge"—partial, fragmentary knowledge stored in cyberspace that generative AI aggregates, structures, and recommends to amplify human understanding without requiring full externalization into explicit knowledge. This addresses the urgent problem of transferring expert tacit knowledge as Japan's workforce ages, going beyond conventional KM systems that struggle with the effort required to formalize knowledge.

The emergence of generative AI is bringing about a significant transformation in knowledge management. Generative AI has the potential to address the limitations of conventional knowledge management systems, and it is increasingly being deployed in real-world settings with promising results. Related research is also expanding rapidly. However, much of this work focuses on research and practice related to the management of explicit knowledge. While fragmentary efforts have been made regarding the management of tacit knowledge using generative AI, the modeling and systematization that handle both tacit and explicit knowledge in an integrated manner remain insufficient. In this paper, we propose the "GenAI SECI" model as an updated version of the knowledge creation process (SECI) model, redesigned to leverage the capabilities of generative AI. A defining feature of the "GenAI SECI" model is the introduction of "Digital Fragmented Knowledge", a new concept that integrates explicit and tacit knowledge within cyberspace. Furthermore, a concrete system architecture for the proposed model is presented, along with a comparison with prior research models that share a similar problem awareness and objectives.
0
cs.CVcs.AIcs.LG Jesper B. Christensen, Ciaran Bench, Spencer A. Thomas et al. · Mar 23, 2026

Ctrl-A addresses automated data augmentation by framing it as a control problem, dynamically adjusting per-operation augmentation strengths via a feedback loop that balances training and validation loss ratios. The method introduces Relative Operation Response (ROR) curves to individually tune transformation distributions without manual initialization or expensive search phases. While it achieves competitive results on CIFAR and SVHN benchmarks with minimal computational overhead (~10% vs. TrivialAugment), the evaluation relies on a modified training setup with extended epochs, raising questions about separability of algorithmic gains from training protocol changes.

We introduce ControlAugment (Ctrl-A), an automated data augmentation algorithm for image-vision tasks, which incorporates principles from control theory for online adjustment of augmentation strength distributions during model training. Ctrl-A eliminates the need for initialization of individual augmentation strengths. Instead, augmentation strength distributions are dynamically, and individually, adapted during training based on a control-loop architecture and what we define as relative operation response curves. Using an operation-dependent update procedure provides Ctrl-A with the potential to suppress augmentation styles that negatively impact model performance, alleviating the need for manually engineering augmentation policies for new image-vision tasks. Experiments on the CIFAR-10, CIFAR-100, and SVHN-core benchmark datasets using the common WideResNet-28-10 architecture demonstrate that Ctrl-A is highly competitive with existing state-of-the-art data augmentation strategies.
0
cs.LGeess.SP M. Cherifi, Aude Sportisse, Xujia Zhu et al. · Mar 22, 2026

The paper proposes AV-LR, a lightweight amortized variational inference framework for logistic regression with missing covariates that eliminates latent variables entirely. Unlike VAE-based competitors, it directly models the posterior over missing values using a single neural network coupled with a linear classification layer, enabling joint optimization of imputation and prediction. The approach extends naturally to MNAR settings and claims substantial computational speedups over EM-based methods while maintaining comparable statistical accuracy.

Missing covariate data pose a significant challenge to statistical inference and machine learning, particularly for classification tasks like logistic regression. Classical iterative approaches (EM, multiple imputation) are often computationally intensive, sensitive to high missingness rates, and limited in uncertainty propagation. Recent deep generative models based on VAEs show promise but rely on complex latent representations. We propose Amortized Variational Inference for Logistic Regression (AV-LR), a unified end-to-end framework for binary logistic regression with missing covariates. AV-LR integrates a probabilistic generative model with a simple amortized inference network, trained jointly by maximizing the evidence lower bound. Unlike competing methods, AV-LR performs inference directly in the space of missing data without additional latent variables, using a single inference network and a linear layer that jointly estimate regression parameters and the missingness mechanism. AV-LR achieves estimation accuracy comparable to or better than state-of-the-art EM-like algorithms, with significantly lower computational cost. It naturally extends to missing-not-at-random settings by explicitly modeling the missingness mechanism. Empirical results on synthetic and real-world datasets confirm its effectiveness and efficiency across various missing-data scenarios.
0
cs.LG Changchun Li, Ximing Li, Bingjie Zhang et al. · Mar 22, 2026

S2tc-bdd addresses Semi-Supervised Text Classification (SSTC) where pseudo-label accuracy suffers from "margin bias" caused by imbalanced label angle variances between classes. The core idea is to balance deep representation distributions by applying Gaussian linear transformations to Angular Margin (AM) loss, thereby eliminating decision boundary bias during self-training. This matters because it targets a fundamental distribution mismatch in SSL that particularly degrades performance when labeled data is scarce.

Semi-Supervised Text Classification (SSTC) mainly works under the spirit of self-training. They initialize the deep classifier by training over labeled texts; and then alternatively predict unlabeled texts as their pseudo-labels and train the deep classifier over the mixture of labeled and pseudo-labeled texts. Naturally, their performance is largely affected by the accuracy of pseudo-labels for unlabeled texts. Unfortunately, they often suffer from low accuracy because of the margin bias problem caused by the large difference between representation distributions of labels in SSTC. To alleviate this problem, we apply the angular margin loss, and perform several Gaussian linear transformations to achieve balanced label angle variances, i.e., the variance of label angles of texts within the same label. More accuracy of predicted pseudo-labels can be achieved by constraining all label angle variances balanced, where they are estimated over both labeled and pseudo-labeled texts during self-training loops. With this insight, we propose a novel SSTC method, namely Semi-Supervised Text Classification with Balanced Deep representation Distributions (S2TC-BDD). We implement both multi-class classification and multi-label classification versions of S2TC-BDD by introducing some pseudo-labeling tricks and regularization terms. To evaluate S2 TC-BDD, we compare it against the state-of-the-art SSTC methods. Empirical results demonstrate the effectiveness of S2 TC-BDD, especially when the labeled texts are scarce.
0
cs.CVcs.AI Yuyang You, Yongzhi Li, Jiahui Li et al. · Mar 23, 2026

Video diffusion models suffer from prohibitive inference costs, but standard image distillation techniques like DMD cause severe oversaturation and temporal collapse when naively extended to video. This work introduces a video-specific distillation framework featuring an adaptive regression loss that dynamically reweights real-data supervision to prevent color artifacts, a temporal variance regularizer to combat static output, and an inference-time frame interpolation module that halves sequence length during high-noise steps to accelerate generation. Applied to Wan2.1, the method enables stable 4-step synthesis with state-of-the-art VBench scores.

Video generation has recently emerged as a central task in the field of generative AI. However, the substantial computational cost inherent in video synthesis makes model distillation a critical technique for efficient deployment. Despite its significance, there is a scarcity of methods specifically designed for video diffusion models. Prevailing approaches often directly adapt image distillation techniques, which frequently lead to artifacts such as oversaturation, temporal inconsistency, and mode collapse. To address these challenges, we propose a novel distillation framework tailored specifically for video diffusion models. Its core innovations include: (1) an adaptive regression loss that dynamically adjusts spatial supervision weights to prevent artifacts arising from excessive distribution shifts; (2) a temporal regularization loss to counteract temporal collapse, promoting smooth and physically plausible sampling trajectories; and (3) an inference-time frame interpolation strategy that reduces sampling overhead while preserving perceptual quality. Extensive experiments and ablation studies on the VBench and VBench2 benchmarks demonstrate that our method achieves stable few-step video synthesis, significantly enhancing perceptual fidelity and motion realism. It consistently outperforms existing distillation baselines across multiple metrics.
0
cs.LGcs.AIcs.CL Xinyan Wang, Xiaogeng Liu, Chaowei Xiao · Mar 23, 2026

ROM tackles overthinking in Large Reasoning Models, where models generate redundant reasoning after reaching correct answers. The core idea is a lightweight streaming detector—an 8.13M parameter head attached to late-layer hidden states of a frozen LLM—that predicts overthinking probability token-by-token and triggers early stopping. It matters because it promises 47% token reduction without full model retraining. We find the method empirically effective but note concerns regarding data scaling limits and labeling costs.

Large Reasoning Models (LRMs) achieve strong accuracy on challenging tasks by generating long Chain-of-Thought traces, but suffer from overthinking. Even after reaching the correct answer, they continue generating redundant reasoning steps. This behavior increases latency and compute cost and can also lead to answer drift. Existing mitigation methods either require training-heavy backbone modification or rely on hand-crafted heuristics that do not truly capture overthinking patterns. We propose ROM, the first method that formulates overthinking mitigation as a streaming prediction-and-control problem. ROM attaches a lightweight detection head to the late-layer hidden states of a frozen large language model backbone. It monitors tokens in real time and triggers an early transition to the final answer once overthinking is detected. We also introduce token-level supervision based on solution correctness boundaries and a data augmentation strategy that reduces distilled-data bias. Across seven benchmarks, ROM achieves the highest accuracy (93.51%), the shortest responses (1,159 tokens), and the best response efficiency. Compared with the vanilla baseline, it reduces response length by 47.2% and improves efficiency by 121%. These results show that streaming detection is a promising approach to real-time overthinking mitigation.
0
cs.AI Shuying Chen, Sen Cui, Zhong Cao · Mar 23, 2026

This paper presents Oph-Guid-RAG, a multimodal retrieval-augmented generation system for ophthalmic clinical decision support. Unlike conventional text-based RAG systems, it retrieves full guideline page images using ColQwen2.5, preserving tables, flowcharts, and layout information without OCR errors. The system introduces a controllable retrieval framework with routing and filtering to selectively introduce external evidence, evaluated on a specialized ophthalmology subset extracted from HealthBench.

In this work, we propose Oph-Guid-RAG, a multimodal visual RAG system for ophthalmology clinical question answering and decision support. We treat each guideline page as an independent evidence unit and directly retrieve page images, preserving tables, flowcharts, and layout information. We further design a controllable retrieval framework with routing and filtering, which selectively introduces external evidence and reduces noise. The system integrates query decomposition, query rewriting, retrieval, reranking, and multimodal reasoning, and provides traceable outputs with guideline page references. We evaluate our method on HealthBench using a doctor-based scoring protocol. On the hard subset, our approach improves the overall score from 0.2969 to 0.3861 (+0.0892, +30.0%) compared to GPT-5.2, and achieves higher accuracy, improving from 0.5956 to 0.6576 (+0.0620, +10.4%). Compared to GPT-5.4, our method achieves a larger accuracy gain of +0.1289 (+24.4%). These results show that our method is more effective on challenging cases that require precise, evidence-based reasoning. Ablation studies further show that reranking, routing, and retrieval design are critical for stable performance, especially under difficult settings. Overall, we show how combining visionbased retrieval with controllable reasoning can improve evidence grounding and robustness in clinical AI applications,while pointing out that further work is needed to be more complete.