Feed - arxlens

0

Frequency Switching Mechanism for Parameter-E!cient Multi-Task Learning

cs.CV cs.LG Shih-Wen Liu, Yen-Chang Chen, Wei-Ta Chu et al. · Mar 22, 2026

This paper tackles parameter-efficient multi-task learning (PEFT-MTL), where the challenge is to share parameters across tasks without interference while maintaining the efficiency of methods like LoRA. The core idea is Free Sinewich: it modulates a shared low-rank convolutional adapter (Sine-AWB) using task-specific sinusoidal frequencies generated by a lightweight Clock Net, achieving task specialization without duplicating parameters. This frequency-switching mechanism is inspired by biological oscillatory multiplexing and aims to decorrelate task weights while boosting effective rank.

Multi-task learning (MTL) aims to enable a single model to solve multiple tasks efficiently; however, current parameter-efficient fine-tuning (PEFT) methods remain largely limited to single-task adaptation. We introduce \textbf{Free Sinewich}, a parameter-efficient multi-task learning framework that enables near-zero-cost weight modulation via frequency switching (\textbf{Free}). Specifically, a \textbf{Sine-AWB (Sinewich)} layer combines low-rank factors and convolutional priors into a single kernel, which is then modulated elementwise by a sinusoidal transformation to produce task-specialized weights. A lightweight Clock Net is introduced to produce bounded frequencies that stabilize this modulation during training. Theoretically, sine modulation enhances the rank of low-rank adapters, while frequency separation decorrelates the weights of different tasks. On dense prediction benchmarks, Free Sinewich achieves state-of-the-art performance-efficiency trade-offs (e.g., up to +5.39\% improvement over single-task fine-tuning with only 6.53M trainable parameters), offering a compact and scalable paradigm based on frequency-based parameter sharing. Project page: \href{https://casperliuliuliu.github.io/projects/Free-Sinewich/}{https://casperliuliuliu.github.io/projects/Free-Sinewich}.

Read abstractHide abstract

0

Pretrained Video Models as Differentiable Physics Simulators for Urban Wind Flows

cs.LG cs.CE Janne Perini, Rafael Bischof, Moab Arar et al. · Mar 22, 2026

WinDiNet repurposes the LTX-Video latent diffusion transformer as a fast, differentiable surrogate for urban wind flow simulation, addressing the prohibitive cost of time-resolved CFD in design exploration. By fine-tuning the 2B-parameter video model on 10,000 2D incompressible CFD simulations over procedurally generated building layouts, the authors achieve sub-second generation of 112-frame rollouts while enabling end-to-end gradient-based optimization of building positions for pedestrian wind comfort.

Designing urban spaces that provide pedestrian wind comfort and safety requires time-resolved Computational Fluid Dynamics (CFD) simulations, but their current computational cost makes extensive design exploration impractical. We introduce WinDiNet (Wind Diffusion Network), a pretrained video diffusion model that is repurposed as a fast, differentiable surrogate for this task. Starting from LTX-Video, a 2B-parameter latent video transformer, we fine-tune on 10,000 2D incompressible CFD simulations over procedurally generated building layouts. A systematic study of training regimes, conditioning mechanisms, and VAE adaptation strategies, including a physics-informed decoder loss, identifies a configuration that outperforms purpose-built neural PDE solvers. The resulting model generates full 112-frame rollouts in under a second. As the surrogate is end-to-end differentiable, it doubles as a physics simulator for gradient-based inverse optimization: given an urban footprint layout, we optimize building positions directly through backpropagation to improve wind safety as well as pedestrian wind comfort. Experiments on single- and multi-inlet layouts show that the optimizer discovers effective layouts even under challenging multi-objective configurations, with all improvements confirmed by ground-truth CFD simulations.

Read abstractHide abstract

0

The Average Relative Entropy and Transpilation Depth determines the noise robustness in Variational Quantum Classifiers

quant-ph cs.LG Aakash Ravindra Shinde, Arianne Meijer - van de Griend, Jukka K. Nurminen · Mar 22, 2026

Variational Quantum Classifiers (VQAs) are typically trained in ideal classical simulations, raising concerns about reproducibility on noisy quantum hardware. This paper proposes that the average relative entropy between class distributions combined with transpilation depth predicts noise robustness—introducing the log-DTSAE metric to forecast accuracy degradation without requiring noisy hardware execution. The authors validate this across thousands of models spanning diverse ansatzes, encodings, and simulated backends from IBM, IQM, and IonQ.

Variational Quantum Algorithms (VQAs) have been extensively researched for applications in Quantum Machine Learning (QML), Optimization, and Molecular simulations. Although designed for Noisy Intermediate-Scale Quantum (NISQ) devices, VQAs are predominantly evaluated classically due to uncertain results on noisy devices and limited resource availability. Raising concern over the reproducibility of simulated VQAs on noisy hardware. While prior studies indicate that VQAs may exhibit noise resilience in specific parameterized shallow quantum circuits, there are no definitive measures to establish what defines a shallow circuit or the optimal circuit depth for VQAs on a noisy platform. These challenges extend naturally to Variational Quantum Classification (VQC) algorithms, a subclass of VQAs for supervised learning. In this article, we propose a relative entropy-based metric to verify whether a VQC model would perform similarly on a noisy device as it does on simulations. We establish a strong correlation between the average relative entropy difference in classes, transpilation circuit depth, and their performance difference on a noisy quantum device. Our results further indicate that circuit depth alone is insufficient to characterize shallow circuits. We present empirical evidence to support these assertions across a diverse array of techniques for implementing VQC, datasets, and multiple noisy quantum devices.

Read abstractHide abstract

0

Deriving Health Metrics from the Photoplethysmogram: Benchmarks and Insights from MIMIC-III-Ext-PPG

cs.LG eess.SP Mohammad Moulaeifard, Philip J. Aston, Peter H. Charlton et al. · Mar 23, 2026

This paper establishes a comprehensive benchmark for photoplethysmography (PPG)-based clinical prediction using the large-scale MIMIC-III-Ext-PPG dataset, evaluating multi-task learning across arrhythmia classification (13 classes) and physiological regression (blood pressure, heart rate, respiratory rate). The core contribution is demonstrating robust atrial fibrillation detection (AUROC 0.96) with strong cross-dataset generalizability, alongside the first systematic assessment of fine-grained arrhythmia classification from PPG alone. It matters because PPG sensors are ubiquitous in wearables and ICUs, yet standardized, large-scale, multi-task benchmarks have been lacking, hindering meaningful algorithm comparison and clinical deployment.

Photoplethysmography (PPG) is one of the most widely captured biosignals for clinical prediction tasks, yet PPG-based algorithms are typically trained on small-scale datasets of uncertain quality, which hinders meaningful algorithm comparisons. We present a comprehensive benchmark for PPG-based clinical prediction using the \dbname~dataset, establishing baselines across the full spectrum of clinically relevant applications: multi-class heart rhythm classification, and regression of physiological parameters including respiratory rate (RR), heart rate (HR), and blood pressure (BP). Most notably, we provide the first comprehensive assessment of PPG for general arrhythmia detection beyond atrial fibrillation (AF) and atrial flutter (AFLT), with performance stratified by BP, HR, and demographic subgroups. Using established deep learning architectures, we achieved strong performance for AF detection (AUROC = 0.96) and accurate physiological parameter estimation (RR MAE: 2.97 bpm; HR MAE: 1.13 bpm; SBP/DBP MAE: 16.13/8.70 mmHg). Cross-dataset validation demonstrates excellent generalizability for AF detection (AUROC = 0.97), while clinical subgroup analysis reveals marked performance differences across subgroups by BP, HR, and demographic strata. These variations appear to reflect population-specific waveform differences rather than systematic bias in model behavior. This framework establishes the first integrated benchmark for multi-task PPG-based clinical prediction, demonstrating that PPG signals can effectively support multiple simultaneous monitoring tasks and providing essential baselines for future algorithm development.

Read abstractHide abstract

0

Feature Incremental Clustering with Generalization Bounds

math.ST cs.LG stat.TH Jing Zhang, Chenping Hou · Mar 23, 2026

Feature incremental clustering addresses dynamic scenarios where data arrives in expanding feature spaces—such as activity recognition systems that acquire new sensors over time. This paper proposes four k-means-based algorithms (FIC-FT, FIC-DR, FIC-DA, FIC-MR) tailored to different data-access constraints, from full historical access to model-only reuse. The core theoretical contribution establishes generalization error bounds for all four settings, revealing that model reuse (FIC-MR) can achieve a fast $\tilde{\mathcal{O}}(1/n_2)$ convergence rate when the pre-trained model aligns well with the current distribution.

In many learning systems, such as activity recognition systems, as new data collection methods continue to emerge in various dynamic environmental applications, the attributes of instances accumulate incrementally, with data being stored in gradually expanding feature spaces. How to design theoretically guaranteed algorithms to effectively cluster this special type of data stream, commonly referred to as activity recognition, remains unexplored. Compared to traditional scenarios, we will face at least two fundamental questions in this feature incremental scenario. (i) How to design preliminary and effective algorithms to address the feature incremental clustering problem? (ii) How to analyze the generalization bounds for the proposed algorithms and under what conditions do these algorithms provide a strong generalization guarantee? To address these problems, by tailoring the most common clustering algorithm, i.e., $k$-means, as an example, we propose four types of Feature Incremental Clustering (FIC) algorithms corresponding to different situations of data access: Feature Tailoring (FT), Data Reconstruction (DR), Data Adaptation (DA), and Model Reuse (MR), abbreviated as FIC-FT, FIC-DR, FIC-DA, and FIC-MR. Subsequently, we offer a detailed analysis of the generalization error bounds for these four algorithms and highlight the critical factors influencing these bounds, such as the amounts of training data, the complexity of the hypothesis space, the quality of pre-trained models, and the discrepancy of the reconstruction feature distribution. The numerical experiments show the effectiveness of the proposed algorithms, particularly in their application to activity recognition clustering tasks.

Read abstractHide abstract

0

Show Me What You Don't Know: Efficient Sampling from Invariant Sets for Model Validation

cs.LG Armand Rousselot, Joran Wendebourg, Ullrich K\"othe · Mar 23, 2026

Understanding what representations neural networks discard is crucial for trustworthy ML. This paper proposes methods to sample from invariant sets (fibers) of feature extractors: either by regularizing conditional generative models with a fiber loss, or by guiding pretrained diffusion models via non-linear diffusion trajectory matching (NDTM). The training-free NDTM approach reduces setup time from days to minutes, enabling rapid analysis of model blind spots including medical safety concerns.

The performance of machine learning models is determined by the quality of their learned features. They should be invariant under irrelevant data variation but sensitive to task-relevant details. To visualize whether this is the case, we propose a method to analyze feature extractors by sampling from their fibers -- equivalence classes defined by their invariances -- given an arbitrary representative. Unlike existing work where a dedicated generative model is trained for each feature detector, our algorithm is training-free and exploits a pretrained diffusion or flow-matching model as a prior. The fiber loss -- which penalizes mismatch in features -- guides the denoising process toward the desired equivalence class, via non-linear diffusion trajectory matching. This replaces days of training for invariance learning with a single guided generation procedure at comparable fidelity. Experiments on popular datasets (ImageNet, CheXpert) and model types (ResNet, DINO, BiomedClip) demonstrate that our framework can reveal invariances ranging from very desirable to concerning behaviour. For instance, we show how Qwen-2B places patients with situs inversus (heart on the right side) in the same fiber as typical anatomy.

Read abstractHide abstract

0

TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference

cs.LG cs.CL Jaber Jaber, Osama Jaber · Mar 22, 2026

TIDE is a post-training early exit system for autoregressive LLMs that trains lightweight router MLPs to predict which tokens can safely exit at intermediate layers. The key idea is using cosine similarity between checkpoint hidden states and final layer outputs as a convergence signal, eliminating the need for costly model retraining. Unlike prior early-exit methods that require training from scratch or use unreliable confidence heuristics, TIDE claims to work with any HuggingFace causal LM while preserving KV cache integrity and achieving up to 8.1% throughput improvement.

Large language models run every token through every layer, regardless of difficulty. We present TIDE, a post-training system that attaches tiny learned routers at periodic checkpoint layers and, at inference time, selects the earliest layer whose hidden state has converged for each token. TIDE requires no model retraining, works with any HuggingFace causal LM, auto-detects GPU architecture, and supports float32, float16, and bfloat16 through fused CUDA kernels. On an NVIDIA A100 with DeepSeek R1 Distill 8B, TIDE achieves 100% prefill exit rate (5% of tokens exit at layer 11, the remaining at layer 31), reduces prefill latency by 7.2%, and increases single-batch throughput by 6.6%. During autoregressive decoding, 98-99% of tokens exit early while the model correctly solves a multi-step math problem with 95 unique output tokens. On Qwen3 8B (36 layers), throughput improves by 8.1% at batch size 8. Calibration on 2,000 WikiText samples takes under 3 minutes and produces a ~4 MB router checkpoint. The system comprises 1,308 lines of Python and 1,081 lines of CUDA/C++ with 74 passing tests. Code: https://github.com/RightNow-AI/TIDE

Read abstractHide abstract

0

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

cs.CL cs.LG Siqi Guo, Ming Lin, Tianbao Yang · Mar 23, 2026

Developing optimized CUDA kernels is critical for generative AI but remains challenging even for human experts. This paper introduces DRTriton, a framework that trains a 7B-parameter LLM to convert PyTorch code into efficient Triton kernels using exclusively synthetic data. The approach combines a constraint satisfaction algorithm for program generation (CSP-DAG), curriculum reinforcement learning with decoupled rewards (DRPO), and test-time search, achieving 92% speedup on KernelBench Level 2 compared to 23% for GPT-5.2.

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose DRTriton, a scalable learning framework for training LLMs to convert PyTorch codes into highly optimized Triton kernels, which are then compiled to CUDA kernels at runtime. DRTriton consists of three key components: (i) a data synthetic algorithm CSP-DAG that guarantees full coverage and unbiased uniform sampling over the operator space with controlled difficulty; (ii) a curriculum reinforcement learning with decoupled reward efficiently optimizes conversion success rate and inference speed simultaneously; and (iii) a test-time search algorithm that further improves the inference speed of the generated Triton kernels. Notably, despite being trained exclusively on synthetic data, DRTriton generalizes effectively to real-world CUDA kernels that are challenging even for human experts. Experimental results show that DRTriton-7B achieves speedup on 92% of the KernelBench Level 2, compared to 23% for GPT-5.2 and 19% for Claude-Sonnet-4.5.

Read abstractHide abstract

0

ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models

cs.LG Xu Li, Yi Zheng, Yuxuan Liang et al. · Mar 22, 2026

Large Vision-Language Models (LVLMs) suffer from quadratic self-attention costs when processing high-resolution images that generate thousands of visual tokens. ResPrune addresses this by formulating token pruning as a subspace reconstruction problem: it greedily selects tokens that maximize residual energy (the orthogonal component unexplained by the current subset) in the LLM input embedding space. To align selection with user queries, it modulates these residuals by a text relevance score computed via cosine similarity with embedded nouns from the prompt. This yields a training-free, plug-in method that preserves semantic coverage while reducing compute.

Large Vision-Language Models (LVLMs) rely on dense visual tokens to capture fine-grained visual information, but processing all these tokens incurs substantial computational and memory overhead during inference. To address this issue, we propose ResPrune, a training-free visual token pruning framework that enables efficient LVLM inference by selecting a compact yet informative subset of visual tokens. ResPrune formulates visual token pruning as a subspace reconstruction problem and employs a greedy subspace expansion strategy guided by residual energy, allowing it to preserve the geometric structure of the original visual token space. To further incorporate cross modal alignment, the selection process is conditioned on textual relevance, encouraging the retention of tokens that are both informative and instruction-relevant. The proposed method is lightweight and model-agnostic, and can be seamlessly integrated into existing LVLM pipelines without retraining or architectural modifications. Extensive experiments on multiple LVLM backbones, including LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL, demonstrate that ResPrune consistently outperforms existing pruning approaches across a wide range of benchmarks, while achieving effective reductions in computation, memory consumption, and inference latency.

Read abstractHide abstract

0

SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

cs.LG cs.CV Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian et al. · Mar 23, 2026

SSAM tackles the problem of merging independently trained multimodal large language models (e.g., vision-language and audio-language specialists) into a single model capable of processing arbitrary modality combinations without any paired multimodal training data. The core idea is to project language-specific parameter updates (task vectors) onto a shared low-rank subspace identified via SVD, thereby aligning consistent update directions while filtering conflicting ones before merging. This is significant because it offers a training-free alternative to expensive joint multimodal training, achieving state-of-the-art results on four benchmarks.

Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.

Read abstractHide abstract

0

A Comparative Analysis of LLM Memorization at Statistical and Internal Levels: Cross-Model Commonalities and Model-Specific Signatures

cs.CL cs.LG Bowen Chen, Namgi Han, Yusuke Miyao · Mar 23, 2026

This paper presents a large-scale comparative study of memorization across six open LLM families (Pythia, OLMo1/2/3, OpenLLaMA, StarCoder) ranging from 1B to 32B parameters. By analyzing both statistical patterns and internal mechanisms (attention heads, layer decoding), it identifies universal behaviors—such as log-linear scaling of memorization rates with model size and high compressibility of memorized sequences—while revealing family-specific signatures in memorization structure. The work bridges isolated findings from single-model studies to establish general principles of how transformers memorize training data.

Memorization is a fundamental component of intelligence for both humans and LLMs. However, while LLM performance scales rapidly, our understanding of memorization lags. Due to limited access to the pre-training data of LLMs, most previous studies focus on a single model series, leading to isolated observations among series, making it unclear which findings are general or specific. In this study, we collect multiple model series (Pythia, OpenLLaMa, StarCoder, OLMo1/2/3) and analyze their shared or unique memorization behavior at both the statistical and internal levels, connecting individual observations while showing new findings. At the statistical level, we reveal that the memorization rate scales log-linearly with model size, and memorized sequences can be further compressed. Further analysis demonstrated a shared frequency and domain distribution pattern for memorized sequences. However, different models also show individual features under the above observations. At the internal level, we find that LLMs can remove certain injected perturbations, while memorized sequences are more sensitive. By decoding middle layers and attention head ablation, we revealed the general decoding process and shared important heads for memorization. However, the distribution of those important heads differs between families, showing a unique family-level feature. Through bridging various experiments and revealing new findings, this study paves the way for a universal and fundamental understanding of memorization in LLM.

Read abstractHide abstract

0

Closed-form conditional diffusion models for data assimilation

stat.ML cs.LG physics.comp-ph Brianna Binder, Assad Oberai · Mar 22, 2026

This paper proposes a training-free conditional diffusion model for Bayesian filtering in data assimilation. Instead of learning the score function via neural networks, the authors leverage kernel density estimation (KDE) to represent the joint distribution of states and measurements, yielding a closed-form expression for the score that enables analytical sampling from the posterior. The method targets nonlinear, non-Gaussian filtering problems where traditional ensemble Kalman filters (EnKF) make restrictive Gaussian approximations and particle filters suffer from weight degeneracy in small-ensemble regimes.

We propose closed-form conditional diffusion models for data assimilation. Diffusion models use data to learn the score function (defined as the gradient of the log-probability density of a data distribution), allowing them to generate new samples from the data distribution by reversing a noise injection process. While it is common to train neural networks to approximate the score function, we leverage the analytical tractability of the score function to assimilate the states of a system with measurements. To enable the efficient evaluation of the score function, we use kernel density estimation to model the joint distribution of the states and their corresponding measurements. The proposed approach also inherits the capability of conditional diffusion models of operating in black-box settings, i.e., the proposed data assimilation approach can accommodate systems and measurement processes without their explicit knowledge. The ability to accommodate black-box systems combined with the superior capabilities of diffusion models in approximating complex, non-Gaussian probability distributions means that the proposed approach offers advantages over many widely used filtering methods. We evaluate the proposed method on nonlinear data assimilation problems based on the Lorenz-63 and Lorenz-96 systems of moderate dimensionality and nonlinear measurement models. Results show the proposed approach outperforms the widely used ensemble Kalman and particle filters when small to moderate ensemble sizes are used.

Read abstractHide abstract

0

Mechanisms of Introspective Awareness

cs.LG Uzay Macar, Li Yang, Atticus Wang et al. · Mar 22, 2026

The paper investigates whether large language models possess genuine "introspective awareness"—the ability to detect and identify concept steering vectors injected into their residual stream—or whether this behavior stems from shallow heuristics. Through behavioral experiments and mechanistic analysis on Gemma3-27B, the authors establish that detection maintains 0% false positives across diverse prompts, emerges specifically from post-training rather than pretraining, and relies on distributed MLP computation involving distinct "evidence carrier" and "gate" features. The work suggests models possess latent introspective capacity that default prompting dramatically under-elicits.

Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of "introspective awareness." But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in open-source models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post-training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by evidence carrier and gate features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by 53pp and a trained steering vector by 75pp. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontrivial internal anomaly detection, and likely could be substantially improved in future models. Code: https://github.com/safety-research/introspection-mechanisms.

Read abstractHide abstract

0

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

cs.LG cs.DC Huamin Chen, Xunzhuo Liu, Bowei He et al. · Mar 22, 2026

This vision paper from the vLLM Semantic Router project proposes the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. The authors synthesize two dozen prior publications into a structured matrix, arguing that workload characteristics, routing policy, and pool architecture are coupled dimensions that must be co-optimized. The paper maps existing work onto a $3\times3$ interaction matrix and proposes twenty-one concrete research directions tiered by maturity.

Over the past year, the vLLM Semantic Router project has released a series of work spanning: (1) core routing mechanisms -- signal-driven routing, context-length pool routing, router performance engineering, policy conflict detection, low-latency embedding models, category-aware semantic caching, user-feedback-driven routing adaptation, hallucination detection, and hierarchical content-safety classification for privacy and jailbreak protection; (2) fleet optimization -- fleet provisioning and energy-efficiency analysis; (3) agentic and multimodal routing -- multimodal agent routing, tool selection, CUA security, and multi-turn context memory and safety; (4) governance and standards -- inference routing protocols and multi-provider API extensions. Each paper tackled a specific problem in LLM inference, but the problems are not independent; for example, fleet provisioning depends on the routing policy, which depends on the workload mix, shifting as organizations adopt agentic and multimodal workloads. This paper distills those results into the Workload-Router-Pool (WRP) architecture, a three-dimensional framework for LLM inference optimization. Workload characterizes what the fleet serves (chat vs. agent, single-turn vs. multi-turn, warm vs. cold, prefill-heavy vs. decode-heavy). Router determines how each request is dispatched (static semantic rules, online bandit adaptation, RL-based model selection, quality-aware cascading). Pool defines where inference runs (homogeneous vs. heterogeneous GPU, disaggregated prefill/decode, KV-cache topology). We map our prior work onto a 3x3 WRP interaction matrix, identify which cells we have covered and which remain open, and propose twenty-one concrete research directions at the intersections, each grounded in our prior measurements, tiered by maturity from engineering-ready to open research.

Read abstractHide abstract

0

Calibeating Made Simple

cs.LG cs.AI cs.GT Yurong Chen, Zhiyi Huang, Michael I. Jordan et al. · Mar 23, 2026

The paper studies calibeating—post-processing external forecasts online to minimize cumulative losses while matching an informativeness-based benchmark. Unlike prior work that used loss-specific arguments, the authors reduce calibeating to standard online learning primitives, showing it is minimax-equivalent to regret minimization. This yields optimal rates for general proper losses and improves bounds for simultaneous calibration and calibeating.

We study calibeating, the problem of post-processing external forecasts online to minimize cumulative losses and match an informativeness-based benchmark. Unlike prior work, which analyzed calibeating for specific losses with specific arguments, we reduce calibeating to existing online learning techniques and obtain results for general proper losses. More concretely, we first show that calibeating is minimax-equivalent to regret minimization. This recovers the $O(\log T)$ calibeating rate of Foster and Hart [FH23] for the Brier and log losses and its optimality, and yields new optimal calibeating rates for mixable losses and general bounded losses. Second, we prove that multi-calibeating is minimax-equivalent to the combination of calibeating and the classical expert problem. This yields new optimal multi-calibeating rates for mixable losses, including Brier and log losses, and general bounded losses. Finally, we obtain new bounds for achieving calibeating and calibration simultaneously for the Brier loss. For binary predictions, our result gives the first calibrated algorithm that at the same time also achieves the optimal $O(\log T)$ calibeating rate.

Read abstractHide abstract

0

JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

cs.CV cs.LG Haolun Zheng, Yu He, Tailun Chen et al. · Mar 22, 2026

JANUS addresses jailbreaking of text-to-image models by reframing the discrete prompt search as optimization over a structured distribution. The framework mixes two Gaussian-anchored prompt distributions—one around the target harmful prompt and one around a sanitized 'clean' version—and uses policy gradient on a single scalar mixing parameter $\alpha$ to maximize end-to-end reward. This avoids both proxy-loss optimization and costly LLM-based generators, achieving substantial efficiency gains while exposing weaknesses in current safety pipelines.

Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

Read abstractHide abstract

0

TimeTox: An LLM-Based Pipeline for Automated Extraction of Time Toxicity from Clinical Trial Protocols

cs.CL cs.LG Saketh Vinjamuri, Marielle Fis Loperena, Marie C. Spezia et al. · Mar 22, 2026

Time toxicity—the cumulative healthcare contact days imposed by clinical trial participation—is an important patient-centric metric buried in dense Schedule of Assessments (SoA) tables. This work proposes TimeTox, a Gemini-based LLM pipeline that extracts time toxicity from protocol PDFs at scale, comparing a single-pass architecture against a two-stage structure-then-count approach. The authors deploy their system on 644 real-world oncology protocols and find that synthetic benchmark accuracy is a poor predictor of real-world reliability, a lesson critical for clinical NLP deployment.

Time toxicity, the cumulative healthcare contact days from clinical trial participation, is an important but labor-intensive metric to extract from protocol documents. We developed TimeTox, an LLM-based pipeline for automated extraction of time toxicity from Schedule of Assessments tables. TimeTox uses Google's Gemini models in three stages: summary extraction from full-length protocol PDFs, time toxicity quantification at six cumulative timepoints for each treatment arm, and multi-run consensus via position-based arm matching. We validated against 20 synthetic schedules (240 comparisons) and assessed reproducibility on 644 real-world oncology protocols. Two architectures were compared: single-pass (vanilla) and two-stage (structure-then-count). The two-stage pipeline achieved 100% clinically acceptable accuracy ($\pm$3 days) on synthetic data (MAE 0.81 days) versus 41.5% for vanilla (MAE 9.0 days). However, on real-world protocols, the vanilla pipeline showed superior reproducibility: 95.3% clinically acceptable accuracy (IQR $\leq$ 3 days) across 3 runs on 644 protocols, with 82.0% perfect stability (IQR = 0). The production pipeline extracted time toxicity for 1,288 treatment arms across multiple disease sites. Extraction stability on real-world data, rather than accuracy on synthetic benchmarks, is the decisive factor for production LLM deployment.

Read abstractHide abstract

0

CoRA: Boosting Time Series Foundation Models for Multivariate Forecasting through Correlation-aware Adapter

cs.LG cs.AI Hanyin Cheng, Xingjian Wu, Yang Shu et al. · Mar 23, 2026

Most Time Series Foundation Models treat channels independently and ignore cross-channel correlations, which limits their performance on multivariate forecasting. This paper proposes CoRA (CoRrelation-aware Adapter), a lightweight plug-in that learns three correlation types—dynamic (time-varying), heterogeneous (positive/negative), and partial (sparse)—through a low-rank decomposition and dual contrastive learning. The key insight is that these correlations can be captured during fine-tuning without re-pretraining the foundation model, and with only linear complexity at inference time.

Most existing Time Series Foundation Models (TSFMs) use channel independent modeling and focus on capturing and generalizing temporal dependencies, while neglecting the correlations among channels or overlooking the different aspects of correlations. However, these correlations play a vital role in Multivariate time series forecasting. To address this, we propose a CoRrelation-aware Adapter (CoRA), a lightweight plug-and-play method that requires only fine-tuning with TSFMs and is able to capture different types of correlations, so as to improve forecast performance. Specifically, to reduce complexity, we innovatively decompose the correlation matrix into low-rank Time-Varying and Time-Invariant components. For the Time-Varying component, we further design learnable polynomials to learn dynamic correlations by capturing trends or periodic patterns. To learn positive and negative correlations that appear only among some channels, we introduce a novel dual contrastive learning method that identifies correlations through projection layers, regulated by a Heterogeneous-Partial contrastive loss during training, without introducing additional complexity in the inference stage. Extensive experiments on 10 real-world datasets demonstrate that CoRA can improve TSFMs in multivariate forecasting performance.

Read abstractHide abstract

0

SmaAT-QMix-UNet: A Parameter-Efficient Vector-Quantized UNet for Precipitation Nowcasting

cs.LG cs.AI Nikolas Stavrou, Siamak Mehrkanoon · Mar 23, 2026

This paper tackles precipitation nowcasting by enhancing the lightweight SmaAT-UNet architecture with two modifications: a vector-quantization (VQ) bottleneck that discretizes latent representations into a learned codebook, and Mixed Convolution (MixConv) blocks that blend multiple kernel sizes to reduce parameters. The goal is to cut model size for edge deployment while preserving forecast skill at a 30-minute lead time.

Weather forecasting supports critical socioeconomic activities and complements environmental protection, yet operational Numerical Weather Prediction (NWP) systems remain computationally intensive, thus being inefficient for certain applications. Meanwhile, recent advances in deep data-driven models have demonstrated promising results in nowcasting tasks. This paper presents SmaAT-QMix-UNet, an enhanced variant of SmaAT-UNet that introduces two key innovations: a vector quantization (VQ) bottleneck at the encoder-decoder bridge, and mixed kernel depth-wise convolutions (MixConv) replacing selected encoder and decoder blocks. These enhancements both reduce the model's size and improve its nowcasting performance. We train and evaluate SmaAT-QMix-UNet on a Dutch radar precipitation dataset (2016-2019), predicting precipitation 30 minutes ahead. Three configurations are benchmarked: using only VQ, only MixConv, and the full SmaAT-QMix-UNet. Grad-CAM saliency maps highlight the regions influencing each nowcast, while a UMAP embedding of the codewords illustrates how the VQ layer clusters encoder outputs. The source code for SmaAT-QMix-UNet is publicly available on GitHub \footnote{\href{https://github.com/nstavr04/MasterThesisSnellius}{https://github.com/nstavr04/MasterThesisSnellius}}.

Read abstractHide abstract

0

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

cs.LG cs.PF Jaber Jaber, Osama Jaber · Mar 22, 2026

GPU kernel optimization is among the most expertise-intensive tasks in ML systems engineering, often requiring weeks of manual tuning per kernel. AutoKernel proposes to automate this via an autonomous agent loop that iteratively edits Triton or CUDA C++ kernels, validates them through a five-stage correctness harness, and keeps or reverts changes based on benchmarked throughput. The system prioritizes kernels by their contribution to total runtime (Amdahl's law) and encodes expert tuning strategies into a six-tier agent playbook. While it demonstrates strong results on memory-bound operations like normalization and softmax, its compute-bound matmul performance remains significantly below vendor library baselines.

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.

Read abstractHide abstract

Nothing here yet