Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.ITcs.LGeess.SP Zijun Qin, Jingxuan Huang, Zesong Fei et al. · Mar 23, 2026

The paper addresses adaptive broadcast of data-intensive sensory streams (e.g., camera/LiDAR) to heterogeneous edge devices with diverse channel conditions and computational budgets. It proposes Nonlinear Transform Rateless Source-Channel Coding (NTRSCC), integrating learned nonlinear transforms with physical-layer Luby Transform (LT) codes to enable receivers to adaptively adjust the number of received symbols and belief propagation iterations. This achieves an explicit, controllable tradeoff between distortion, transmission rate, and decoding complexity—addressing key limitations of fixed-rate DeepJSCC schemes that either underserve capable devices or require costly retransmissions.

In recent years, numerous data-intensive broadcasting applications have emerged at the wireless edge, calling for a flexible tradeoff between distortion, transmission rate, and processing complexity. While deep learning-based joint source-channel coding (DeepJSCC) has been identified as a potential solution to data-intensive communications, most of these schemes are confined to worst-case solutions, lack adaptive complexity, and are inefficient in broadcast settings. To overcome these limitations, this paper introduces nonlinear transform rateless source-channel coding (NTRSCC), a variable-length JSCC framework for broadcast channels based on rateless codes. In particular, we integrate learned source transformations with physical-layer LT codes, develop unequal protection schemes that exploit decoder side information, and devise approximations to enable end-to-end optimization of rateless parameters. Our framework enables heterogeneous receivers to adaptively adjust their received number of rateless symbols and decoding iterations in belief propagation, thereby achieving a controllable tradeoff between distortion, rate, and decoding complexity. Simulation results demonstrate that the proposed method enhances image broadcast quality under stringent communication and processing budgets over heterogeneous edge devices.
0
cs.CV Detao Bai, Shimin Yao, Weixuan Chen et al. · Mar 23, 2026

The paper addresses the problem of identifying "Who said what and when" in multi-speaker video conversations, which current Omni-modal LLMs fail at due to sparse visual sampling (1-2 fps) and "shortcut learning" on visual biases. The authors introduce VR-SDR (Visual-Registered Speaker Diarization and Recognition), a rigorous benchmark that forces models to bind identities from natural language descriptions without visual shortcuts. They propose HumanOmni-Speaker, featuring a Visual Delta Encoder that samples video at 25 fps yet compresses inter-frame motion residuals into only 6 tokens per frame to capture fine-grained visemes while avoiding token explosion.

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
0
stat.APcs.LG Emma Hannula, Jana de Wiljes, Matthew T. Moores et al. · Mar 23, 2026

This paper investigates amortized Bayesian inference (ABI) for estimating coupling parameters in Kuramoto oscillator networks—a nonlinear dynamical system widely used to study synchronization. The authors apply neural posterior estimation via BayesFlow to learn an amortized approximation of the posterior distribution from simulated phase dynamics. While the method succeeds for simple single-parameter networks, the paper's central finding is that it fails for complex multi-node networks due to structural non-identifiability and data inefficiency—making the title's focus on 'limitations' well-earned.

Bayesian inference is a powerful tool for parameter estimation and uncertainty quantification in dynamical systems. However, for nonlinear oscillator networks such as Kuramoto models, widely used to study synchronization phenomena in physics, biology, and engineering, inference is often computationally prohibitive due to high-dimensional state spaces and intractable likelihood functions. We present an amortized Bayesian inference approach that learns a neural approximation of the posterior from simulated phase dynamics, enabling fast, scalable inference without repeated sampling or optimization. Applied to synthetic Kuramoto networks, the method shows promising results in approximating posterior distributions and capturing uncertainty, with computational savings compared to traditional Bayesian techniques. These findings suggest that amortized inference is a practical and flexible framework for uncertainty-aware analysis of oscillator networks.
0
cs.CV Thomas Mendelson, Joshua Francois, Galit Lahav et al. · Mar 22, 2026

Instance segmentation in dense microscopy images requires separating tightly packed, touching cells—a task where binary masks and pixel-wise losses often merge adjacent instances. This paper proposes predicting continuous signed distance functions (SDFs) mapped to probabilistic segmentations via a learnable sigmoid, trained with a differentiable Modified Hausdorff Distance (MHD) loss. The approach eliminates the need for interactive prompting or watershed post-processing while aiming to improve boundary fidelity in high-throughput cellular imaging.

Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: https://github.com/ThomasMendelson/BAISeg.git
0
cs.CV Lev Ayzenberg, Shady Abu-Hussein, Raja Giryes et al. · Mar 23, 2026

MRI acquisition is inherently slow due to sequential k-space sampling. This paper proposes TRUST-MRI, an active sampling framework that leverages discrete anatomical tokens from the pretrained MedITok tokenizer and a latent Transformer to guide measurement selection. The core innovation uses token prediction entropy as an uncertainty signal, introducing two policies: Latent Entropy Selection (LES) projects patch-wise entropy to k-space to select lines, while Gradient-based Entropy Optimization (GEO) uses gradients of total entropy with respect to input measurements. The approach trades pixel-wise fidelity for perceptual quality and computational efficiency, achieving superior feature-based metrics while running at 0.97 fps compared to 0.01 fps for diffusion-based active methods.

Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at https://github.com/levayz/TRUST-MRI.
0
cs.CL Stella Eva Tsiapali, Cong-Thanh Do, Kate Knill · Mar 23, 2026

Cross-tokenizer knowledge distillation faces a fundamental alignment challenge when Teacher and Student models use different vocabularies. This paper analyzes DSKD-CMA, the state-of-the-art method for this setting, through manual chunk alignment probes and reveals that its cross-model attention mechanism captures coarse chunk structures but suffers from noisy localization with repeated tokens. Building on this insight, the authors propose DSKD-CMA-GA, which uses generative adversarial key-query matching to align distributions between models, achieving modest improvements in ROUGE-L scores that narrow the gap between cross-tokenizer and same-tokenizer distillation.

Large language models (LLMs) achieve state-of-the-art (SOTA) performance across language tasks, but are costly to deploy due to their size and resource demands. Knowledge Distillation (KD) addresses this by training smaller Student models to mimic larger Teacher models, improving efficiency without significant performance loss. Dual-Space Knowledge Distillation with Cross-Model Attention (DSKD-CMA) has emerged as a SOTA method for KD between LLMs with distinct tokenizers, yet its internal workings remain largely opaque. In this work, we systematically analyse the attention mechanism of DSKD-CMA through manual token alignment probing and heatmap visualisations, revealing both strengths and limitations. Building on this, we introduce a novel method, DSKD-CMA-GA, based on Generative Adversarial (GA) learning, to address the mismatched distributions between the keys and queries computed from distinct models. Experiments show modest but consistent ROUGE-L gains in text generation quality, particularly on out-of-distribution data (+0.37 on average), narrowing the gap between cross- and same-tokenizer KD.
0
cs.CV Pengchong Hu, Zhizhong Han · Mar 22, 2026

RGBD SLAM with 3D Gaussian Splatting (3DGS) struggles to balance scalability against rendering fidelity: global Gaussians consume excessive GPU memory, while view-tied Gaussians (fixed at depth) suffer from limited novel-view quality. This paper proposes pixel-aligned Gaussians that can adjust their positions along viewing rays via learned depth offsets, paired with a fast geometry-similarity tracking strategy using Generalized ICP on depth distributions. The approach claims state-of-the-art rendering and tracking performance while maintaining smaller active memory footprints than prior 3DGS-based methods.

3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at https://machineperceptionlab.github.io/SGAD-SLAM-Project .
0
cs.CV Yaelle Zribi (ENC), Florian Cafiero (ENC, LRE) et al. · Mar 23, 2026

Stand-up comedy depends as much on timing and embodied presence as on verbal content, yet computational humor has largely focused on text alone. This paper introduces TIC-TALK, a multimodal corpus of 90 professionally filmed Netflix specials (2015–2024) with temporally aligned annotations for language, gesture, and audience response. The processing pipeline combines BERTopic for thematic segmentation, Whisper-AT for laughter detection, and YOLOv8 for shot classification and pose keypoint extraction, all aligned hierarchically without resampling. The authors validate the resource through corpus-level findings including a negative correlation between kinetic energy and laughter rate ($r = -0.75$), consistent with a stillness-before-punchline pattern, and through a short-horizon laughter prediction benchmark.

Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.
0
cs.LGcs.SE Tianxiang Xu, Xiaoyan Zhu, Xin Lai et al. · Mar 23, 2026

This paper addresses paper-code consistency detection in bioinformatics, tackling the reproducibility crisis where algorithmic descriptions in publications often diverge from software implementations. The authors introduce BioCon, a benchmark of 48 bioinformatics projects with expert-annotated sentence-code pairs, and propose a cross-modal framework using UniXcoder with weighted focal loss. While the task is important for computational biology reproducibility, claims of novelty require qualification given concurrent efforts in the broader scientific community.

Ensuring consistency between research papers and their corresponding software implementations is fundamental to software reliability and scientific reproducibility. However, this problem remains underexplored, particularly in the domain of bioinformatics, where discrepancies between methodological descriptions in papers and their actual code implementations are prevalent. To address this gap, this paper introduces a new task, namely paper-code consistency detection, and curates a collection of 48 bioinformatics software projects along with their associated publications. We systematically align sentence-level algorithmic descriptions from papers with function-level code snippets. Combined with expert annotations and a hybrid negative sampling strategy, we construct the first benchmark dataset in the bioinformatics domain tailored to this task, termed BioCon. Based on this benchmark, we further propose a cross-modal consistency detection framework designed to model the semantic relationships between natural language descriptions and code implementations. The framework adopts a unified input representation and leverages pre-trained models to capture deep semantic alignment between papers and code. To mitigate the effects of class imbalance and hard samples, we incorporate a weighted focal loss to enhance model robustness. Experimental results demonstrate that our framework effectively identifies consistency between papers and code in bioinformatics, achieving an accuracy of 0.9056 and an F1 score of 0.8011. Overall, this study opens a new research direction for paper-code consistency analysis and lays the foundation for automated reproducibility assessment and cross-modal understanding in scientific software.
0
cs.CV Mingle Zhou, Jiahui Liu, Jin Wan et al. · Mar 23, 2026

This paper tackles Unsupervised Continuous Anomaly Detection (UCAD), where models must sequentially learn new product categories without forgetting previous ones or storing all raw data. The core idea is to augment visual-only approaches with learnable text prompts from CLIP, storing both modalities in a Continuous Multimodal Prompt Memory Bank (CMPMB) and fusing them via a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM). Benchmarked on MVTec AD and VisA, the authors claim state-of-the-art detection accuracy (+4.4% AUROC) and segmentation (+14.8% AUPR) over the prior UCAD baseline.

Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.
0
cs.CVcs.AIcs.CL Haichao Zhang, Yijiang Li, Shwai He et al. · Mar 23, 2026

ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
0
cs.LG Andrii Shportko · Mar 23, 2026

This paper establishes information-theoretic limits on LLM steganography, proving that any semantic-preserving embedding of a payload $P$ into a covertext $M_1$ to produce stegotext $M_2$ must increase Kolmogorov complexity by at least $K(P) - O(\log n)$. Since Kolmogorov complexity is uncomputable, the authors propose perplexity ratios (specifically the Binoculars score) as a practical proxy and validate the approach on a color-based encoding scheme with 300 samples.

Large language models can rewrite text to embed hidden payloads while preserving surface-level meaning, a capability that opens covert channels between cooperating AI systems and poses challenges for alignment monitoring. We study the information-theoretic cost of such embedding. Our main result is that any steganographic scheme that preserves the semantic load of a covertext~$M_1$ while encoding a payload~$P$ into a stegotext~$M_2$ must satisfy $K(M_2) \geq K(M_1) + K(P) - O(\log n)$, where $K$ denotes Kolmogorov complexity and $n$ is the combined message length. A corollary is that any non-trivial payload forces a strict complexity increase in the stegotext, regardless of how cleverly the encoder distributes the signal. Because Kolmogorov complexity is uncomputable, we ask whether practical proxies can detect this predicted increase. Drawing on the classical correspondence between lossless compression and Kolmogorov complexity, we argue that language-model perplexity occupies an analogous role in the probabilistic regime and propose the Binoculars perplexity-ratio score as one such proxy. Preliminary experiments with a color-based LLM steganographic scheme support the theoretical prediction: a paired $t$-test over 300 samples yields $t = 5.11$, $p < 10^{-6}$.
0
cs.CV Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick et al. · Mar 22, 2026

CVT-Bench evaluates whether multimodal LLMs can maintain stable spatial representations under counterfactual viewpoint transformations—such as inferring object relationships from a camera angle never shown in the image. Using 100 synthetic tabletop scenes and 6,000 relational queries across rotations from $0^{\circ}$ to $360^{\circ}$, the benchmark reveals that state-of-the-art models, despite high single-view accuracy, systematically fail at mental rotation tasks and degrade further under extended sequential context. These findings challenge the assumption that strong episodic spatial performance implies robust viewpoint-invariant representations, with critical implications for embodied AI and robotics applications requiring perspective-taking.

Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360{\deg} cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
0
cs.LG Ziyang Zhang, Zheshun Wu, Jie Liu et al. · Mar 23, 2026

SparseDVFS tackles energy-efficient DNN inference on edge devices by bridging the gap between coarse model-level and prohibitive operator-level DVFS. The core insight is using operator sparsity to distinguish compute-bound and memory-bound phases, applying specialized frequency triplets via a block-level strategy. A white-box offline modeler, greedy graph partitioner with amortization constraints, and unified co-governor with look-ahead pipelining collectively achieve substantial energy savings while managing switching overheads.

Deploying deep neural networks (DNNs) on power-sensitive edge devices presents a formidable challenge. While Dynamic Voltage and Frequency Scaling (DVFS) is widely employed for energy optimization, traditional model-level scaling is often too coarse to capture intra-inference variations, whereas fine-grained operator-level scaling suffers from prohibitive performance degradation due to significant hardware switching latency. This paper presents SparseDVFS, a fine-grained, sparse-aware DVFS framework designed for energy-efficient edge inference. Our key insight is that operator sparsity is a primary metric for hardware frequency modulation. By distinguishing between compute-bound dense operators and memory-bound sparse operators, the system can apply specialized frequency triplets to maximize energy efficiency. To overcome switching overheads and component interference, SparseDVFS incorporates three key innovations: (1) an offline modeler that established a deterministic mapping between operator sparsity and optimal frequency triplets (CPU/GPU/EMC) via white-box timeline analysis; (2) a runtime graph partitioner that utilizes a greedy merging heuristic to aggregate operators into super-blocks, balancing scaling granularity and DVFS switching latency through a latency amortization constraint; and (3) a unified co-governor that employs a frequency unified scaling engine (FUSE) and a look-ahead instruction queue to eliminate antagonistic effects between independent controllers and hide hardware transition latencies. Extensive evaluations show that SparseDVFS achieves an average 78.17% energy efficiency gain over state-of-the-art solutions while maintaining a superior 14% cost-gain ratio.
0
cs.CV Zengqun Zhao, Yanzuo Lu, Ziquan Liu et al. · Mar 22, 2026

Autoregressive video diffusion models struggle with minute-scale generation due to error accumulation in long-horizon rollouts. This paper challenges the assumption that more memory is better, proposing instead to decompose KV-cache conditioning into three functional roles—Sink for global anchors, Tail for recent continuity, and dynamically selected History for mid-range structure. The result is a training-free inference method that improves motion dynamics by 66.8% while cutting attention overhead by roughly 2.6×.

Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
0
cs.CL Shixu Liu · Mar 23, 2026

Weather captioning—generating natural language descriptions from meteorological time series—sits at the intersection of time-series analysis and domain-specific NLG. This paper proposes WeatherTGD, a training-free framework that treats caption refinement as gradient descent in text space: three specialized LLM agents (Statistical, Physics, Meteorology) output textual gradients that are fused via a consensus-aware mechanism and applied iteratively to improve an initial caption. The approach aims to bridge the gap between numerical forecasting and human-interpretable explanations without any model fine-tuning.

Generating interpretable natural language captions from weather time series data remains a significant challenge at the intersection of meteorological science and natural language processing. While recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities in time series forecasting and analysis, existing approaches either produce numerical predictions without human-accessible explanations or generate generic descriptions lacking domain-specific depth. We introduce WeatherTGD, a training-free multi-agent framework that reinterprets collaborative caption refinement through the lens of Text Gradient Descent (TGD). Our system deploys three specialized LLM agents including a Statistical Analyst, a Physics Interpreter, and a Meteorology Expert that generate domain-specific textual gradients from weather time series observations. These gradients are aggregated through a novel Consensus-Aware Gradient Fusion mechanism that extracts common signals while preserving unique domain perspectives. The fused gradients then guide an iterative refinement process analogous to gradient descent, where each LLM-generated feedback signal updates the caption toward an optimal solution. Experiments on real-world meteorological datasets demonstrate that WeatherTGD achieves significant improvements in both LLM-based evaluation and human expert evaluation, substantially outperforming existing multi-agent baselines while maintaining computational efficiency through parallel agent execution.
0
cs.CVcs.MM Zhiyang Tang, Yiming Zhu, Ruimin Huang et al. · Mar 22, 2026

This paper tackles the Close Small Object Unmixing (CSOU) problem for infrared imagery, where distant clustered targets appear as overlapping mixed spots due to optical diffraction limits. The authors propose DSCSNet, a deep-unfolded network that unrolls the ADMM algorithm with learnable parameters to recover target count, sub-pixel positions, and radiant intensities from mixed spots. The core idea is to replace the traditional ℓ2-norm smoothness terms with strict ℓ1-norm sparsity constraints and add a dynamic thresholding mechanism for scene-adaptive reconstruction.

Due to the limitations of optical lens focal length and detector resolution, distant clustered infrared small targets often appear as mixed spots. The Close Small Object Unmixing (CSOU) task aims to recover the number, sub-pixel positions, and radiant intensities of individual targets from these spots, which is a highly ill-posed inverse problem. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches and the dynamic scene adaptability of data-driven methods. To address this dilemma, this paper proposes a Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples the Alternating Direction Method of Multipliers (ADMM) with learnable parameters. Specifically, we embed a strict $\ell_1$-norm sparsity constraint into the auxiliary variable update step of ADMM to replace the traditional $\ell_2$-norm smoothness-promoting terms, which effectively preserves the discrete energy peaks of small targets. We also integrate a self-attention-based dynamic thresholding mechanism into the reconstruction stage, which adaptively adjusts the sparsification intensity using the sparsity-enhanced information from the iterative process. These modules are jointly optimized end-to-end across the three iterative steps of ADMM. Retaining the physical logic of compressed sensing, DSCSNet achieves robust sparsity induction and scene adaptability, thus enhancing the unmixing accuracy and generalization in complex infrared scenarios. Extensive experiments on the synthetic infrared dataset CSIST-100K demonstrate that DSCSNet outperforms state-of-the-art methods in key metrics such as CSO-mAP and sub-pixel localization error.
0
cs.CL Li Wang, Yandong Wang, Xin Yu et al. · Mar 23, 2026

TAMTRL addresses the temporal credit assignment problem in multi-turn RL for long-context document processing. When LLMs process documents chunk-by-chunk with memory updates, standard outcome-only rewards cannot distinguish good from bad intermediate memory updates. The paper proposes using the model itself as a teacher: during training, it provides the model with filtered (relevant-only) chunks and uses the normalized token probabilities of the generated memory as turn-level rewards. This avoids expensive rollouts or external judges while providing fine-grained supervision for each turn.

The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be processed in a single pass, making chunk-wise processing necessary. This requires multiple turns to read different chunks and update memory. However, supervision is typically provided only by the final outcome, which makes it difficult to evaluate the quality of memory updates at each turn in the multi-turn training setting. This introduces a temporal credit assignment challenge. Existing approaches, such as LLM-as-a-judge or process reward models, incur substantial computational overhead and suffer from estimation noise. To better address the credit assignment problem in multi-turn memory training, we propose Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning (TAMTRL). TAMTRL leverages relevant documents as teacher signals by aligning them with each turn of model input and assigns rewards through normalized probabilities in a self-supervised manner. This provides fine-grained learning signals for each memory update and improves long-context processing. Experiments with multiple models of varying scales across seven long-context benchmarks show that TAMTRL consistently outperforms strong baselines, demonstrating its effectiveness. Our code is available at https://anonymous.4open.science/r/TAMTRL-F1F8.
0
cs.CV Yuqiu Liu, Jialin Song, Marissa Ramirez de Chanlatte et al. · Mar 22, 2026

FluidGaussian addresses a critical gap in 3D reconstruction: methods optimized solely for photometric losses often produce visually plausible but physically implausible geometries that fail in downstream simulations. The paper proposes coupling 3D Gaussian Splatting with incompressible fluid simulation (SPH/DFSPH) to define a simulation-based uncertainty metric—velocity divergence at the fluid-structure interface—and integrates it into active view selection. By reranking next-best-view candidates using this physical signal, the method improves both visual fidelity (PSNR) and physical plausibility (divergence) on synthetic and aerodynamic datasets.

Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available at https://github.com/delta-lab-ai/FluidGaussian.
0
cs.CV Yasamin Medghalchi, Milad Yazdani, Amirhossein Dabiriaghdam et al. · Mar 22, 2026

Medical Vision-Language Models (Med-VLMs) for ultrasound analysis are vulnerable to subtle prompt variations that mimic real clinical communication patterns. This paper proposes a black-box attack framework using an LLM to generate minimal, clinically plausible text edits guided by Monte Carlo Tree Search (MCTS), requiring no access to the target model's weights or gradients. The study reveals that small adversarial rewrites can drastically degrade diagnostic QA accuracy—raising critical safety concerns for deploying such systems in point-of-care settings where prompt variability is inherent.

Ultrasound is widely used in clinical practice due to its portability, cost-effectiveness, safety, and real-time imaging capabilities. However, image acquisition and interpretation remain highly operator dependent, motivating the development of robust AI-assisted analysis methods. Vision-language models (VLMs) have recently demonstrated strong multimodal reasoning capabilities and competitive performance in medical image analysis, including ultrasound. However, emerging evidence highlights significant concerns about their trustworthiness. In particular, adversarial robustness is critical because Med-VLMs operate via natural-language instructions, rendering prompt formulation a realistic and practically exploitable point of vulnerability. Small variations (typos, shorthand, underspecified requests, or ambiguous wording) can meaningfully shift model outputs. We propose a scalable adversarial evaluation framework that leverages a large language model (LLM) to generate clinically plausible adversarial prompt variants via &#34;humanized&#34; rewrites and minimal edits that mimic routine clinical communication. Using ultrasound multiple-choice question answering benchmarks, we systematically assess the vulnerability of SOTA Med-VLMs to these attacks, examine how attacker LLM capacity influences attack success, analyze the relationship between attack success and model confidence, and identify consistent failure patterns across models. Our results highlight realistic robustness gaps that must be addressed for safe clinical translation. Code will be released publicly following the review process.