Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.CLcs.CY Bros Victor, Dufraisse Evan, Popescu Adrian et al. · Mar 23, 2026

This paper analyzes temporal dynamics in Swiss digital news across French, German, and Italian language regions using a triangulated methodology that combines quantitative NLP with qualitative interpretation. The authors process 1.7 million articles to study how different event types—Brexit, Swiss Wolf, Christmas, and the British Royal Family—are covered across linguistic boundaries, introducing domestication profiles and proximity salience ratios to quantify cultural proximity effects.

Analyzing news coverage in multilingual societies can offer valuable insights into the dynamics of public discourse and the development of collective narratives, yet comprehensive studies that account for linguistic and cultural diversity within national media ecosystems remain limited, particularly in complex contexts such as Switzerland. This paper studies temporal trends in Swiss digital media across the country's three main linguistic regions, French, German, and Italian, using a triangulated methodology that combines quantitative analyses with qualitative insights. We collected and processed over 1.7 million news articles, applying lexical metrics, named entity recognition and Wikidata-based linking, targeted sentiment analysis, and consensus-based change-point detection. To enable principled cross-language comparisons and to connect to theories of domestication and cultural proximity, we derive domestication profiles together with a proximity salience ratio. Our analysis spans thematic, recurrent, and singular events. By integrating quantitative data with qualitative interpretation, we provide new insights into the dynamics of Swiss digital media and demonstrate the usefulness of triangulation in media studies. The findings reveal distinct temporal patterns and highlight how linguistic and cultural contexts influence reporting. Our approach offers a framework applicable to other multilingual or culturally diverse media environments, contributing to a deeper understanding of how news is shaped by linguistic and cultural factors.
0
cs.CV Jae Won Jang, Yeonjin Chang, Wonsik Shin et al. · Mar 23, 2026

4DGS360 addresses the ill-posed challenge of reconstructing dynamic objects from monocular video by tackling a critical failure mode: existing methods rely on 2D-native priors that overfit to visible surfaces and cannot reconstruct occluded regions at extreme viewpoints (>90°). The authors propose AnchorTAP3D, a hybrid 3D tracker that leverages high-confidence 2D track points as spatial-temporal anchors to stabilize long-term tracking and resolve depth ambiguity in occluded areas. Combined with a new iPhone360 benchmark featuring test cameras up to 135° from training views, the method enables coherent 360° 4D reconstruction without diffusion priors.

We introduce 4DGS360, a diffusion-free framework for 360$^{\circ}$ dynamic object reconstruction from casual monocular video. Existing methods often fail to reconstruct consistent 360$^{\circ}$ geometry, as their heavy reliance on 2D-native priors causes initial points to overfit to visible surface in each training view. 4DGS360 addresses this challenge through a advanced 3D-native initialization that mitigates the geometric ambiguity of occluded regions. Our proposed 3D tracker, AnchorTAP3D, produces reinforced 3D point trajectories by leveraging confident 2D track points as anchors, suppressing drift and providing reliable initialization that preserves geometry in occluded regions. This initialization, combined with optimization, yields coherent 360$^{\circ}$ 4D reconstructions. We further present iPhone360, a new benchmark where test cameras are placed up to 135$^{\circ}$ apart from training views, enabling 360$^{\circ}$ evaluation that existing datasets cannot provide. Experiments show that 4DGS360 achieves state-of-the-art performance on the iPhone360, iPhone, and DAVIS datasets, both qualitatively and quantitatively.
0
math.OCmath.PRstat.ML Samy Mekkaoui, Huy\^en Pham, Xavier Warin · Mar 23, 2026

This paper develops a neural operator framework for approximating mappings defined on constrained Wasserstein spaces $\mathcal{M}_\lambda$, consisting of probability measures on $I \times \mathbb{R}^d$ with prescribed marginal $\lambda$ on the label space $I$. The core contribution is the DeepONetCyl architecture, which combines cylindrical moment approximations $\Phi_J(\mu) = (\langle \varphi_1, \mu \rangle, \ldots, \langle \varphi_J, \mu \rangle)$ with a DeepONet-type branch–trunk structure to preserve the marginal constraint. This enables learning of heterogeneous (non-exchangeable) mean-field control problems where agent interactions depend on labels, extending prior neural methods beyond the exchangeable case.

We study the approximation of operators acting on probability measures on a product space with prescribed marginal. Let $I$ be a label space endowed with a reference measure $\lambda$, and define $\cal M_\lambda$ as the set of probability measures on $I\times \mathbb{R}^d$ with first marginal $\lambda$. By disintegration, elements of $\cal M_\lambda$ correspond to families of labeled conditional distributions. Operators defined on this constrained measure space arise naturally in mean-field control problems with heterogeneous, non-exchangeable agents. Our main theoretical result establishes a universal approximation theorem for continuous operators on $\cal M_\lambda$. The proof combines cylindrical approximations of probability measures with DeepONet-type branch-trunk neural architecture, yielding finite-dimensional representations of such operators. We further introduce a sampling strategy for generating training measures in $\cal M_\lambda$, enabling practical learning of such conditional mean-field operators. We apply the method to the numerical resolution of mean-field control problems with heterogeneous interactions, thereby extending previous neural approaches developed for homogeneous (exchangeable) systems. Numerical experiments illustrate the accuracy and computational effectiveness of the proposed framework.
0
cs.CV Meiqi Wu, Zhixin Cai, Fufangchen Zhao et al. · Mar 23, 2026

Omni-WorldBench addresses the gap between passive video generation metrics and active world model evaluation by focusing on interactive response—how actions causally drive state transitions across space and time. It introduces Omni-WorldSuite, a 1,068-prompt hierarchical taxonomy spanning three interaction levels (single-object to global environmental effects), and Omni-Metrics, an agent-based evaluation protocol that aggregates Interaction Effect Fidelity, Generated Video Quality, and Camera-Object Controllability into an adaptive AgenticScore.

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentally neglect temporal dynamics. We argue that the future of world modeling lies in 4D generation, which jointly models spatial structure and temporal evolution. In this paradigm, the core capability is interactive response: the ability to faithfully reflect how interaction actions drive state transitions across space and time. Yet no existing benchmark systematically evaluates this critical dimension. To address this gap, we propose Omni--WorldBench, a comprehensive benchmark specifically designed to evaluate the interactive response capabilities of world models in 4D settings. Omni--WorldBench comprises two key components: Omni--WorldSuite, a systematic prompt suite spanning diverse interaction levels and scene types; and Omni--Metrics, an agent-based evaluation framework that quantifies world modeling capabilities by measuring the causal impact of interaction actions on both final outcomes and intermediate state evolution trajectories. We conduct extensive evaluations of 18 representative world models across multiple paradigms. Our analysis reveals critical limitations of current world models in interactive response, providing actionable insights for future research. Omni-WorldBench will be publicly released to foster progress in interactive 4D world modeling.
0
cs.CV Guandong Li, Zhaobin Chu · Mar 23, 2026

AdaEdit tackles the injection dilemma in flow-based image editing, where source feature injection preserves backgrounds but suppresses novel content generation. The authors propose two training-free adaptations: a Progressive Injection Schedule using continuous decay functions (sigmoid, cosine, linear) instead of binary cutoffs, and Channel-Selective Latent Perturbation that applies per-channel AdaIN based on distributional gaps between inverted and random latents. Extensive experiments on PIE-Bench show AdaEdit improves background preservation metrics by 8.7% LPIPS reduction versus ProEdit while maintaining competitive CLIP scores.

Inversion-based image editing in flow matching models has emerged as a powerful paradigm for training-free, text-guided image manipulation. A central challenge in this paradigm is the injection dilemma: injecting source features during denoising preserves the background of the original image but simultaneously suppresses the model's ability to synthesize edited content. Existing methods address this with fixed injection strategies -- binary on/off temporal schedules, uniform spatial mixing ratios, and channel-agnostic latent perturbation -- that ignore the inherently heterogeneous nature of injection demand across both the temporal and channel dimensions. In this paper, we present AdaEdit, a training-free adaptive editing framework that resolves this dilemma through two complementary innovations. First, we propose a Progressive Injection Schedule that replaces hard binary cutoffs with continuous decay functions (sigmoid, cosine, or linear), enabling a smooth transition from source-feature preservation to target-feature generation and eliminating feature discontinuity artifacts. Second, we introduce Channel-Selective Latent Perturbation, which estimates per-channel importance based on the distributional gap between the inverted and random latents and applies differentiated perturbation strengths accordingly -- strongly perturbing edit-relevant channels while preserving structure-encoding channels. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing types) demonstrate that AdaEdit achieves an 8.7% reduction in LPIPS, a 2.6% improvement in SSIM, and a 2.3% improvement in PSNR over strong baselines, while maintaining competitive CLIP similarity. AdaEdit is fully plug-and-play and compatible with multiple ODE solvers including Euler, RF-Solver, and FireFlow. Code is available at https://github.com/leeguandong/AdaEdit
0
cs.CLcs.MA Mohamed Sobhi Jabal (1), Jikai Zhang (2, 3) et al. · Mar 23, 2026

This paper tackles the challenge of automating BT-RADS (Brain Tumor Reporting and Data System) classification for post-treatment glioma MRI surveillance. BT-RADS requires integrating complex information: volumetric tumor changes, medication effects (steroids, bevacizumab), and radiation timing. The authors propose an end-to-end pipeline combining CNN-based tumor segmentation with a multi-agent LLM system to extract clinical variables from unstructured notes and apply algorithmic scoring logic. This matters because manual BT-RADS scoring is error-prone, with prior studies showing substantial inter-reader variability and inconsistent application of clinical context.

The Brain Tumor Reporting and Data System (BT-RADS) standardizes post-treatment MRI response assessment in patients with diffuse gliomas but requires complex integration of imaging trends, medication effects, and radiation timing. This study evaluates an end-to-end multi-agent large language model (LLM) and convolutional neural network (CNN) system for automated BT-RADS classification. A multi-agent LLM system combined with automated CNN-based tumor segmentation was retrospectively evaluated on 509 consecutive post-treatment glioma MRI examinations from a single high-volume center. An extractor agent identified clinical variables (steroid status, bevacizumab status, radiation date) from unstructured clinical notes, while a scorer agent applied BT-RADS decision logic integrating extracted variables with volumetric measurements. Expert reference standard classifications were established by an independent board-certified neuroradiologist. Of 509 examinations, 492 met inclusion criteria. The system achieved 374/492 (76.0%; 95% CI, 72.1%-79.6%) accuracy versus 283/492 (57.5%; 95% CI, 53.1%-61.8%) for initial clinical assessments (+18.5 percentage points; P<.001). Context-dependent categories showed high sensitivity (BT-1b 100%, BT-1a 92.7%, BT-3a 87.5%), while threshold-dependent categories showed moderate sensitivity (BT-3c 74.8%, BT-2 69.2%, BT-4 69.3%, BT-3b 57.1%). For BT-4, positive predictive value was 92.9%. The multi-agent LLM system achieved higher BT-RADS classification agreement with expert reference standard compared to initial clinical scoring, with high accuracy for context-dependent scores and high positive predictive value for BT-4 detection.
0
cs.CV Altamirano-Mu\~niz Emilio Fernando · Mar 22, 2026

This paper presents PhotoBeamSolver, a hybrid system that converts hand-drawn beam diagrams into analytical structural solutions by combining computer vision with large language models. The core idea uses a custom-trained YOLO-based detector to identify supports and loads from images, feeding a symbolic solver that computes shear, moment, and deflection diagrams. While targeted at academic and quick professional verification tasks, the work highlights the challenges of integrating deep learning into safety-critical structural engineering workflows.

This paper presents the development of a documented program capable of solving idealized beam models, such as those commonly used in textbooks and academic exercises, from drawings made by a person. The system is based on computer vision and statistical learning techniques for the detection and visual interpretation of structural elements. Likewise, the main challenges and limitations associated with the integration of computer vision into structural analysis are analyzed, as well as the requirements necessary for its reliable application in the field of civil engineering. In this context, the implementation of the PhotoBeamSolver program is explored, and the current state of computer vision in civil engineering is discussed, particularly in relation to structural analysis, infrastructure inspection, and engineering decision-support systems.
0
stat.MLcs.LG Atticus Rex, Elizabeth Qian, David Peterson · Mar 23, 2026

Multifidelity surrogate modeling aims to leverage cheap low-fidelity simulations to improve predictions of expensive high-fidelity models when training data is scarce. This paper proposes MAGPI, a Gaussian process regression method that augments the high-fidelity input space with features derived from recursively-trained low-fidelity surrogate models. The approach unifies desirable properties from cokriging and autoregressive estimators while allowing non-GP models for low-fidelity levels, achieving superior accuracy and computational efficiency.

Supervised machine learning describes the practice of fitting a parameterized model to labeled input-output data. Supervised machine learning methods have demonstrated promise in learning efficient surrogate models that can (partially) replace expensive high-fidelity models, making many-query analyses, such as optimization, uncertainty quantification, and inference, tractable. However, when training data must be obtained through the evaluation of an expensive model or experiment, the amount of training data that can be obtained is often limited, which can make learned surrogate models unreliable. However, in many engineering and scientific settings, cheaper \emph{low-fidelity} models may be available, for example arising from simplified physics modeling or coarse grids. These models may be used to generate additional low-fidelity training data. The goal of \emph{multifidelity} machine learning is to use both high- and low-fidelity training data to learn a surrogate model which is cheaper to evaluate than the high-fidelity model, but more accurate than any available low-fidelity model. This work proposes a new multifidelity training approach for Gaussian process regression which uses low-fidelity data to define additional features that augment the input space of the learned model. The approach unites desirable properties from two separate classes of existing multifidelity GPR approaches, cokriging and autoregressive estimators. Numerical experiments on several test problems demonstrate both increased predictive accuracy and reduced computational cost relative to the state of the art.
0
cs.CV Yanglin Deng, Tianyang Xu, Chunyang Cheng et al. · Mar 23, 2026

This paper challenges the long-held assumption that infrared and visible image fusion (IVIF) requires strictly paired training data. The authors propose UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP), demonstrating that pixel-level self-supervision enables training on unaligned cross-modal combinations. By reformulating the maximum likelihood objective to treat infrared and visible images as independent variables, they show that a base dataset of $N$ pairs can be expanded to $N^2$ trainable combinations, potentially reducing collection costs while improving generalization.

Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF\_unpair}}.
0
cs.CVcs.GRcs.RO Hang Dai, Hongwei Fan, Han Zhang et al. · Mar 23, 2026

Articulated object reconstruction typically requires either multi-view capture of discrete states or monocular video with a strict static-base-part assumption, limiting practical deployment. FreeArtGS introduces a "free-moving" setting where both joint angles and object poses vary arbitrarily during capture, using only a monocular RGB-D video. The method combines motion-based part segmentation via point tracking priors with joint estimation and 3D Gaussian Splatting optimization to jointly reconstruct geometry, appearance, and articulation.

The increasing demand for augmented reality and robotics is driving the need for articulated object reconstruction with high scalability. However, existing settings for reconstructing from discrete articulation states or casual monocular videos require non-trivial axis alignment or suffer from insufficient coverage, limiting their applicability. In this paper, we introduce FreeArtGS, a novel method for reconstructing articulated objects under free-moving scenario, a new setting with a simple setup and high scalability. FreeArtGS combines free-moving part segmentation with joint estimation and end-to-end optimization, taking only a monocular RGB-D video as input. By optimizing with the priors from off-the-shelf point-tracking and feature models, the free-moving part segmentation module identifies rigid parts from relative motion under unconstrained capture. The joint estimation module calibrates the unified object-to-camera poses and recovers joint type and axis robustly from part segmentation. Finally, 3DGS-based end-to-end optimization is implemented to jointly reconstruct visual textures, geometry, and joint angles of the articulated object. We conduct experiments on two benchmarks and real-world free-moving articulated objects. Experimental results demonstrate that FreeArtGS consistently excels in reconstructing free-moving articulated objects and remains highly competitive in previous reconstruction settings, proving itself a practical and effective solution for realistic asset generation. The project page is available at: https://freeartgs.github.io/
0
cs.ROcs.CV Zhiyan Cao, Zhengxi Wu, Yiwei Wang et al. · Mar 22, 2026

Cardiac ultrasound view acquisition is notoriously operator-dependent, limiting reproducibility and access. This paper proposes an anatomical prior (AP)-driven framework that unifies cardiac structure segmentation with autonomous probe adjustment. The core innovation is a spatial-relation graph (SRG) module that injects spatial-topological constraints into YOLO-based segmentation, coupled with an RL formulation where states and rewards are built from quantifiable anatomical features drawn from Gaussian priors. The work matters because it offers an interpretable alternative to black-box end-to-end methods, potentially enabling zero-shot sim-to-real deployment for robotic echocardiography.

Cardiac ultrasound diagnosis is critical for cardiovascular disease assessment, but acquiring standard views remains highly operator-dependent. Existing medical segmentation models often yield anatomically inconsistent results in images with poor textural differentiation between distinct feature classes, while autonomous probe adjustment methods either rely on simplistic heuristic rules or black-box learning. To address these issues, our study proposed an anatomical prior (AP)-driven framework integrating cardiac structure segmentation and autonomous probe adjustment for standard view acquisition. A YOLO-based multi-class segmentation model augmented by a spatial-relation graph (SRG) module is designed to embed AP into the feature pyramid. Quantifiable anatomical features of standard views are extracted. Their priors are fitted to Gaussian distributions to construct probabilistic APs. The probe adjustment process of robotic ultrasound scanning is formalized as a reinforcement learning (RL) problem, with the RL state built from real-time anatomical features and the reward reflecting the AP matching. Experiments validate the efficacy of the framework. The SRG-YOLOv11s improves mAP50 by 11.3% and mIoU by 6.8% on the Special Case dataset, while the RL agent achieves a 92.5% success rate in simulation and 86.7% in phantom experiments.
0
cs.CV Hanze Jia, Chunshi Wang, Yuxiao Yang et al. · Mar 23, 2026

3D fragment reassembly becomes challenging at scale because incorrect contact adjacencies trigger cascading failures. This paper proposes SARe, a generative framework that explicitly models contact structure by jointly predicting fracture-surface tokens and inter-fragment adjacency graphs, paired with an inference-time refinement stage that anchors reliable substructures to correct uncertain regions. The work demonstrates state-of-the-art results across synthetic and real fracture datasets, with notable improvements in the many-fragment ($K$) regime.

3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.
0
eess.IVcs.CV Jiahui Song, Sagar Shrestha, Xiao Fu · Mar 23, 2026

This paper tackles unregistered hyperspectral-multispectral image fusion (HMF), where spatially misaligned images with partial overlap must be mutually super-resolved without training data or co-registration. The authors propose FRESCO, a two-stage unsupervised framework that uses coupled block-term tensor decomposition (BTD) for MSI spectral super-resolution and latent-space adversarial learning for HSI spatial super-resolution. The work is notable for offering the first theoretical recoverability guarantees in the unregistered setting, addressing a practically important gap in remote sensing.

This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.
0
cs.LG Dharshan Kumaran, Nathaniel Daw, Simon Osindero et al. · Mar 23, 2026

This paper investigates whether large language models exhibit metacognitive control—specifically, whether they use internal confidence signals to guide abstention decisions (knowing when to answer versus withhold responses). The authors develop a rigorous four-phase paradigm combining behavioral analysis, activation steering, and computational modeling to demonstrate that abstention arises from a two-stage confidence-decision pathway involving confidence representation formation followed by threshold-based policy implementation. Their findings suggest that LLMs deploy native confidence signals in a structured manner paralleling biological metacognition, with substantial implications for safe AI deployment.

Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention paradigm.Phase 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed thresholds.Our findings indicate that abstention arises from the joint operation of internal confidence representations and threshold-based policies, mirroring the two-stage metacognitive control found in biological systems. This capacity is essential as LLMs transition into autonomous agents that must recognize their own uncertainty to decide when to act or seek help.
0
cs.CV Masoumeh Sharafi, Muhammad Osama Zeeshan, Soufiane Belharbi et al. · Mar 22, 2026

Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes models using three coordinated caches—a fixed source-domain prototype cache, a dynamic positive target cache for reliable samples, and a negative cache for uncertain predictions—coupled with a tri-gate filtering mechanism to prevent error accumulation.

Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.
0
cs.CV Young-Seo Chang, Yatong An, Jae-Sang Hyun · Mar 22, 2026

DepthTCM tackles depth map compression by combining physics-inspired Multiwavelength Depth (MWD) encoding—mapping depth to sinusoidal 3-channel images—with global 4-bit quantization and a Transformer-CNN mixed learned codec. The core claim is that this hybrid approach reshapes depth statistics into a form amenable to modern learned image compression, achieving 60% bitrate reduction over prior MWD methods while maintaining >99% geometric accuracy.

We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.
0
cs.CV Jianlin Chen, Gongyang Li, Zhijiang Zhang et al. · Mar 23, 2026

The paper addresses the quadratic complexity of transformer attention and limited local detail extraction in RGB-D Salient Object Detection (SOD). It proposes STENet, which introduces superpixels as intermediate tokens to reduce computational overhead while preserving structural coherence. The core idea replaces global pixel-to-pixel attention with two modules: one for pixel-to-superpixel global enhancement and another for intra-superpixel local refinement, aiming to balance efficiency and accuracy.

Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet.
0
cs.SEcs.LG Idan Amit · Mar 22, 2026

This paper investigates which static analysis alert removals actually reduce bug rates—a critical question since developers constantly face noisy linting warnings. The author employs three complementary methods: a randomized controlled trial with 521 manual interventions, labeling functions to identify intervention-like events in 8,245 natural commits, and supervised learning to predict beneficial removals. The core finding is that removing complexity alerts (too-many-branches, too-many-nested-blocks) via method extraction reduces bug tendency by 4.1–5.5 percentage points, offering evidence-based guidance for prioritizing refactoring efforts.

Context: Static analysis captures software engineering knowledge and alerts on possibly problematic patterns. Previous work showed that they indeed have predictive power for various problems. However, the impact of removing the alerts is unclear. Aim: We would like to evaluate the impact of alert removals on code complexity and the tendency to bugs. Method: We evaluate the impact of removing alerts using three complementary methods. 1. We conducted a randomized controlled trial and built a dataset of 521 manual alert-removing interventions 2. We profiled intervention-like events using labeling functions. We applied these labeling functions to code commits, found intervention-like natural events, and used them to analyze the impact on the tendency to bugs. 3. We built a dataset of 8,245 alert removals, more than 15 times larger than our dataset of manual interventions. We applied supervised learning to the alert removals, aiming to predict their impact on the tendency to bugs. Results: We identified complexity-reducing interventions that reduce the probability of future bugs. Such interventions are relevant to 33\% of Python files and might reduce the tendency to bugs by 5.5 percentage points. Conclusions: We presented methods to evaluate the impact of interventions. The methods can identify a large number of natural interventions that are highly needed in causality research in many domains.
0
cs.CLcs.CV Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta et al. · Mar 22, 2026

BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.
0
cs.CV Qingdong He, Chaoyi Wang, Peng Tang et al. · Mar 23, 2026

Video subtitle removal traditionally requires expensive per-frame mask annotations and external detection modules during both training and inference. CLEAR introduces a two-stage mask-free framework that decouples prior extraction (via self-supervised disentangled feature learning) from generative refinement (via LoRA-adapted diffusion with adaptive weighting). The method claims to train only 0.77% of base model parameters while achieving +6.77dB PSNR gains and zero-shot generalization across six languages without ground-truth masks at inference.

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.