Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
186 papers in cs.CV
Trending mixes fresh papers with community signal.
0
cs.CV Nikolay Kormushev, Josip \v{S}ari\'c, Matej Kristan · Mar 22, 2026

Open-vocabulary panoptic segmentation aims to recognize and segment arbitrary object categories beyond training vocabularies, but suffers from two coupled failures: mask transformers discard proposals for unseen categories due to biased objectness scoring, while CLIP's global image-text alignment poorly localizes to image regions. OVRCOAT addresses both via COAT—which adjusts foreground probabilities using CLIP's classification confidence to rescue out-of-vocabulary masks—and OVR, a memory-efficient fine-tuning protocol for region-text alignment. The approach achieves +5.5% PQ gains on ADE20K and reduces training memory by 56% versus prior SOTA.

Open-vocabulary panoptic segmentation remains hindered by two coupled issues: (i) mask selection bias, where objectness heads trained on closed vocabularies suppress masks of categories not observed in training, and (ii) limited regional understanding in vision-language models such as CLIP, which were optimized for global image classification rather than localized segmentation. We introduce OVRCOAT, a simple, modular framework that tackles both. First, a CLIP-conditioned objectness adjustment (COAT) updates background/foreground probabilities, preserving high-quality masks for out-of-vocabulary objects. Second, an open-vocabulary mask-to-text refinement (OVR) strengthens CLIP's region-level alignment to improve classification of both seen and unseen classes with markedly lower memory cost than prior fine-tuning schemes. The two components combine to jointly improve objectness estimation and mask recognition, yielding consistent panoptic gains. Despite its simplicity, OVRCOAT sets a new state of the art on ADE20K (+5.5% PQ) and delivers clear gains on Mapillary Vistas and Cityscapes (+7.1% and +3% PQ, respectively). The code is available at: https://github.com/nickormushev/OVRCOAT
0
cs.CV Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna et al. · Mar 23, 2026

Existing visual privacy benchmarks treat privacy as a binary property, but this work argues that privacy is fundamentally compositional: benign attributes in isolation can combine to create severe violations. The authors introduce the Compositional Privacy Risk Taxonomy (CPRT), a four-level framework aligned with regulations like GDPR and HIPAA that assigns continuous severity scores based on attribute interactions. They construct a dataset of 6,736 images annotated for 22 privacy attributes and evaluate frontier vision-language models, finding that while structured taxonomic guidance improves alignment, models systematically underestimate composition-driven risks.

Existing visual privacy benchmarks largely treat privacy as a binary property, labeling images as private or non-private based on visible sensitive content. We argue that privacy is fundamentally compositional. Attributes that are benign in isolation may combine to produce severe privacy violations. We introduce the Compositional Privacy Risk Taxonomy (CPRT), a regulation-aware framework that organizes visual attributes according to standalone identifiability and compositional harm potential. CPRT defines four graded severity levels and is paired with an interpretable scoring function that assigns continuous privacy severity scores. We further construct a taxonomy-aligned dataset of 6.7K images and derive ground-truth compositional risk scores. By evaluating frontier and open-weight VLMs we find that frontier models align well with compositional severity when provided structured guidance, but systematically underestimate composition-driven risks. Smaller models struggle to internalize graded privacy reasoning. To bridge this gap, we introduce a deployable 8B supervised fine-tuned (SFT) model that closely matches frontier-level performance on compositional privacy assessment.
0
cs.CVcs.AI Ziyi Wang, Xinshun Wang, Shuang Chen et al. · Mar 23, 2026

UniMotion addresses the fragmentation in human motion modeling by unifying motion, text, and RGB understanding/generation within a single 1.5B parameter architecture. Unlike prior work relying on discrete tokenization or handling only partial modality subsets, it treats motion as a continuous first-class modality via a Cross-Modal Aligned Motion VAE (CMA-VAE). The framework introduces Dual-Posterior KL Alignment to distill visual semantics into motion representations without requiring images at inference, and Latent Reconstruction Alignment to bootstrap the motion pathway through dense self-supervision before sparse text calibration.

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) and predominantly rely on discrete tokenization, which introduces quantization errors and disrupts temporal continuity. UniMotion overcomes both limitations through a core principle: treating motion as a first-class continuous modality on equal footing with RGB. A novel Cross-Modal Aligned Motion VAE (CMA-VAE) and symmetric dual-path embedders construct parallel continuous pathways for Motion and RGB within a shared LLM backbone. To inject visual-semantic priors into motion representations without requiring images at inference, we propose Dual-Posterior KL Alignment (DPA), which distills a vision-fused encoder's richer posterior into the motion-only encoder. To address the cold-start problem -- where text supervision alone is too sparse to calibrate the newly introduced motion pathway -- we further propose Latent Reconstruction Alignment (LRA), a self-supervised pre-training strategy that uses dense motion latents as unambiguous conditions to co-calibrate the embedder, backbone, and flow head, establishing a stable motion-aware foundation for all downstream tasks. UniMotion achieves state-of-the-art performance across seven tasks spanning any-to-any understanding, generation, and editing among the three modalities, with especially strong advantages on cross-modal compositional tasks.
0
cs.CVcs.MM Guowei Tang, Tianwen Qian, Huanran Zheng et al. · Mar 23, 2026

StreamingEval introduces a unified evaluation framework for Video-LLMs under realistic streaming constraints, moving beyond offline benchmarks to assess continuous, real-time video understanding with limited memory. The protocol enforces a fixed-capacity memory bank and jointly measures encoding throughput (MaxFPS), decoding latency (TTFT), memory usage, and task accuracy via a composite StreamingScore. Experiments reveal that current "online" models often fail under strict streaming constraints, while offline models adapted with FIFO memory banks frequently outperform specialized streaming architectures at the cost of higher resource consumption.

Real-time, continuous understanding of visual signals is essential for real-world interactive AI applications, and poses a fundamental system-level challenge. Existing research on streaming video understanding, however, typically focuses on isolated aspects such as question-answering accuracy under limited visual context or improvements in encoding efficiency, while largely overlooking practical deployability under realistic resource constraints. To bridge this gap, we introduce StreamingEval, a unified evaluation framework for assessing the streaming video understanding capabilities of Video-LLMs under realistic constraints. StreamingEval benchmarks both mainstream offline models and recent online video models under a standardized protocol, explicitly characterizing the trade-off between efficiency, storage and accuracy. Specifically, we adopt a fixed-capacity memory bank to normalize accessible historical visual context, and jointly evaluate visual encoding efficiency, text decoding latency, and task performance to quantify overall system deployability. Extensive experiments across multiple datasets reveal substantial gaps between current Video-LLMs and the requirements of realistic streaming applications, providing a systematic basis for future research in this direction. Codes will be released at https://github.com/wwgTang-111/StreamingEval1.
0
cs.CV Injae Kim, Chaehyeon Kim, Minseong Bae et al. · Mar 22, 2026

F4Splat tackles inefficient Gaussian allocation in feed-forward 3D Gaussian Splatting (3DGS), where existing methods uniformly assign Gaussians per pixel or voxel, causing redundancy and fixed budgets. The core idea is a learnable densification score that predicts spatial regions needing additional Gaussians based on geometric complexity and multi-view overlap, enabling adaptive allocation and explicit budget control without retraining. This matters because it delivers compact scene representations—using 10–28% of the Gaussians of prior work—while maintaining or improving rendering fidelity.

Feed-forward 3D Gaussian Splatting methods enable single-pass reconstruction and real-time rendering. However, they typically adopt rigid pixel-to-Gaussian or voxel-to-Gaussian pipelines that uniformly allocate Gaussians, leading to redundant Gaussians across views. Moreover, they lack an effective mechanism to control the total number of Gaussians while maintaining reconstruction fidelity. To address these limitations, we present F4Splat, which performs Feed-Forward predictive densification for Feed-Forward 3D Gaussian Splatting, introducing a densification-score-guided allocation strategy that adaptively distributes Gaussians according to spatial complexity and multi-view overlap. Our model predicts per-region densification scores to estimate the required Gaussian density and allows explicit control over the final Gaussian budget without retraining. This spatially adaptive allocation reduces redundancy in simple regions and minimizes duplicate Gaussians across overlapping views, producing compact yet high-quality 3D representations. Extensive experiments demonstrate that our model achieves superior novel-view synthesis performance compared to prior uncalibrated feed-forward methods, while using significantly fewer Gaussians.
0
cs.ROcs.CV Yuheng Ji, Yuyang Liu, Huajie Tan et al. · Mar 23, 2026

PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.

Current robotic evaluation is still largely dominated by binary success rates, which collapse rich execution processes into a single outcome and obscure critical qualities such as progress, efficiency, and stability. To address this limitation, we propose PRM-as-a-Judge, a dense evaluation paradigm that leverages Process Reward Models (PRMs) to audit policy execution directly from trajectory videos by estimating task progress from observation sequences. Central to this paradigm is the OPD (Outcome-Process-Diagnosis) metric system, which explicitly formalizes execution quality via a task-aligned progress potential. We characterize dense robotic evaluation through two axiomatic properties: macro-consistency, which requires additive and path-consistent aggregation, and micro-resolution, which requires sensitivity to fine-grained physical evolution. Under this formulation, potential-based PRM judges provide a natural instantiation of dense evaluation, with macro-consistency following directly from the induced scalar potential. We empirically validate the micro-resolution property using RoboPulse, a diagnostic benchmark specifically designed for probing micro-scale progress discrimination, where several trajectory-trained PRM judges outperform discriminative similarity-based methods and general-purpose foundation-model judges. Finally, leveraging PRM-as-a-Judge and the OPD metric system, we conduct a structured audit of mainstream policy paradigms across long-horizon tasks, revealing behavioral signatures and failure modes that are invisible to outcome-only metrics.
0
cs.CV Shenghan Chen, Yiming Liu, Yanzhen Wang et al. · Mar 22, 2026

This paper addresses long-tailed (LT) learning by proposing that the head-tail performance trade-off stems from "tail performance degradation"—where models overfit to head classes and forget tail classes. The core idea reframes LT learning as continual learning, using a Grouped Knowledge Preservation (GKP) module to maintain class-specific optimal parameters and a Grouped Sharpness Aware (GSA) module to find flatter minima. The method operates without external data or pre-trained models, showing improvements on CIFAR-LT, ImageNet-LT, and iNaturalist benchmarks.

Balancing performance trade-off on long-tail (LT) data distributions remains a long-standing challenge. In this paper, we posit that this dilemma stems from a phenomenon called "tail performance degradation" (the model tends to severely overfit on head classes while quickly forgetting tail classes) and pose a solution from a loss landscape perspective. We observe that different classes possess divergent convergence points in the loss landscape. Besides, this divergence is aggravated when the model settles into sharp and non-robust minima, rather than a shared and flat solution that is beneficial for all classes. In light of this, we propose a continual learning inspired framework to prevent "tail performance degradation". To avoid inefficient per-class parameter preservation, a Grouped Knowledge Preservation module is proposed to memorize group-specific convergence parameters, promoting convergence towards a shared solution. Concurrently, our framework integrates a Grouped Sharpness Aware module to seek flatter minima by explicitly addressing the geometry of the loss landscape. Notably, our framework requires neither external training samples nor pre-trained models, facilitating the broad applicability. Extensive experiments on four benchmarks demonstrate significant performance gains over state-of-the-art methods. The code is available at:https://gkp-gsa.github.io/.
0
cs.CV Xiaoshan Wu, Xiaoyang Lyu, Yifei Yu et al. · Mar 22, 2026

This paper introduces a new computer vision task called Anytime Interframe Semantic Segmentation: predicting dense semantic segmentation at arbitrary timestamps between low-frame-rate RGB frames using only a past frame and asynchronous event data. The core idea is feature propagation via event-driven motion fields rather than direct multi-modal fusion. The method is motivated by the perceptual gaps created by LFR cameras in high-speed autonomous driving scenarios, where critical events (e.g., pedestrians entering paths) may be missed between frames.

Dense semantic segmentation in dynamic environments is fundamentally limited by the low-frame-rate (LFR) nature of standard cameras, which creates critical perceptual gaps between frames. To solve this, we introduce Anytime Interframe Semantic Segmentation: a new task for predicting segmentation at any arbitrary time using only a single past RGB frame and a stream of asynchronous event data. This task presents a core challenge: how to robustly propagate dense semantic features using a motion field derived from sparse and often noisy event data, all while mitigating feature degradation in highly dynamic scenes. We propose LiFR-Seg, a novel framework that directly addresses these challenges by propagating deep semantic features through time. The core of our method is an uncertainty-aware warping process, guided by an event-driven motion field and its learned, explicit confidence. A temporal memory attention module further ensures coherence in dynamic scenarios. We validate our method on the DSEC dataset and a new high-frequency synthetic benchmark (SHF-DSEC) we contribute. Remarkably, our LFR system achieves performance (73.82% mIoU on DSEC) that is statistically indistinguishable from an HFR upper-bound (within 0.09%) that has full access to the target frame. This work presents a new, efficient paradigm for achieving robust, high-frame-rate perception with low-frame-rate hardware. Project Page: https://candy-crusher.github.io/LiFR_Seg_Proj/#; Code: https://github.com/Candy-Crusher/LiFR-Seg.git.
0
cs.CV Bin Hu, Zipeng Qi, Guoxi Huang et al. · Mar 22, 2026

Single-view reference-to-video methods struggle to preserve identity when faces rotate through large angles. This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The work is relevant for digital character creation and visual effects where identity must remain consistent across extreme viewpoints.

Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing additional reference images exacerbates the \textit{copy-paste} problem, particularly the \textbf{\textit{view-dependent copy-paste}} artifact, which reduces facial motion naturalness. Although cross-paired data can alleviate this issue, collecting such data is costly. To balance the consistency and naturalness, we propose $\mathrm{Mv}^2\mathrm{ID}$, a multi-view conditioned framework under in-paired supervision. We introduce a region-masking training strategy to prevent shortcut learning and extract essential identity features by encouraging the model to aggregate complementary identity cues across views. In addition, we design a reference decoupled-RoPE mechanism that assigns distinct positional encoding to video and conditioning tokens for better modeling of their heterogeneous properties. Furthermore, we construct a large-scale dataset with diverse facial-angle variations and propose dedicated evaluation metrics for identity consistency and motion naturalness. Extensive experiments demonstrate that our method significantly improves identity consistency while maintaining motion naturalness, outperforming existing approaches trained with cross-paired data.
0
cs.CV Detao Bai, Shimin Yao, Weixuan Chen et al. · Mar 23, 2026

The paper addresses the problem of identifying "Who said what and when" in multi-speaker video conversations, which current Omni-modal LLMs fail at due to sparse visual sampling (1-2 fps) and "shortcut learning" on visual biases. The authors introduce VR-SDR (Visual-Registered Speaker Diarization and Recognition), a rigorous benchmark that forces models to bind identities from natural language descriptions without visual shortcuts. They propose HumanOmni-Speaker, featuring a Visual Delta Encoder that samples video at 25 fps yet compresses inter-frame motion residuals into only 6 tokens per frame to capture fine-grained visemes while avoiding token explosion.

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.
0
cs.CV Thomas Mendelson, Joshua Francois, Galit Lahav et al. · Mar 22, 2026

Instance segmentation in dense microscopy images requires separating tightly packed, touching cells—a task where binary masks and pixel-wise losses often merge adjacent instances. This paper proposes predicting continuous signed distance functions (SDFs) mapped to probabilistic segmentations via a learnable sigmoid, trained with a differentiable Modified Hausdorff Distance (MHD) loss. The approach eliminates the need for interactive prompting or watershed post-processing while aiming to improve boundary fidelity in high-throughput cellular imaging.

Accurate delineation of individual cells in microscopy videos is essential for studying cellular dynamics, yet separating touching or overlapping instances remains a persistent challenge. Although foundation-model for segmentation such as SAM have broadened the accessibility of image segmentation, they still struggle to separate nearby cell instances in dense microscopy scenes without extensive prompting. We propose a prompt-free, boundary-aware instance segmentation framework that predicts signed distance functions (SDFs) instead of binary masks, enabling smooth and geometry-consistent modeling of cell contours. A learned sigmoid mapping converts SDFs into probability maps, yielding sharp boundary localization and robust separation of adjacent instances. Training is guided by a unified Modified Hausdorff Distance (MHD) loss that integrates region- and boundary-based terms. Evaluations on both public and private high-throughput microscopy datasets demonstrate improved boundary accuracy and instance-level performance compared to recent SAM-based and foundation-model approaches. Source code is available at: https://github.com/ThomasMendelson/BAISeg.git
0
cs.CV Lev Ayzenberg, Shady Abu-Hussein, Raja Giryes et al. · Mar 23, 2026

MRI acquisition is inherently slow due to sequential k-space sampling. This paper proposes TRUST-MRI, an active sampling framework that leverages discrete anatomical tokens from the pretrained MedITok tokenizer and a latent Transformer to guide measurement selection. The core innovation uses token prediction entropy as an uncertainty signal, introducing two policies: Latent Entropy Selection (LES) projects patch-wise entropy to k-space to select lines, while Gradient-based Entropy Optimization (GEO) uses gradients of total entropy with respect to input measurements. The approach trades pixel-wise fidelity for perceptual quality and computational efficiency, achieving superior feature-based metrics while running at 0.97 fps compared to 0.01 fps for diffusion-based active methods.

Full data acquisition in MRI is inherently slow, which limits clinical throughput and increases patient discomfort. Compressed Sensing MRI (CS-MRI) seeks to accelerate acquisition by reconstructing images from under-sampled k-space data, requiring both an optimal sampling trajectory and a high-fidelity reconstruction model. In this work, we propose a novel active sampling framework that leverages the inherent discrete structure of a pretrained medical image tokenizer and a latent transformer. By representing anatomy through a dictionary of quantized visual tokens, the model provides a well-defined probability distribution over the latent space. We utilize this distribution to derive a principled uncertainty measure via token entropy, which guides the active sampling process. We introduce two strategies to exploit this latent uncertainty: (1) Latent Entropy Selection (LES), projecting patch-wise token entropy into the $k$-space domain to identify informative sampling lines, and (2) Gradient-based Entropy Optimization (GEO), which identifies regions of maximum uncertainty reduction via the $k$-space gradient of a total latent entropy loss. We evaluate our framework on the fastMRI singlecoil Knee and Brain datasets at $\times 8$ and $\times 16$ acceleration. Our results demonstrate that our active policies outperform state-of-the-art baselines in perceptual metrics, and feature-based distances. Our code is available at https://github.com/levayz/TRUST-MRI.
0
cs.CV Pengchong Hu, Zhizhong Han · Mar 22, 2026

RGBD SLAM with 3D Gaussian Splatting (3DGS) struggles to balance scalability against rendering fidelity: global Gaussians consume excessive GPU memory, while view-tied Gaussians (fixed at depth) suffer from limited novel-view quality. This paper proposes pixel-aligned Gaussians that can adjust their positions along viewing rays via learned depth offsets, paired with a fast geometry-similarity tracking strategy using Generalized ICP on depth distributions. The approach claims state-of-the-art rendering and tracking performance while maintaining smaller active memory footprints than prior 3DGS-based methods.

3D Gaussian Splatting (3DGS) has made remarkable progress in RGBD SLAM. Current methods usually use 3D Gaussians or view-tied 3D Gaussians to represent radiance fields in tracking and mapping. However, these Gaussians are either too flexible or too limited in movements, resulting in slow convergence or limited rendering quality. To resolve this issue, we adopt pixel-aligned Gaussians but allow each Gaussian to adjust its position along its ray to maximize the rendering quality, even if Gaussians are simplified to improve system scalability. To speed up the tracking, we model the depth distribution around each pixel as a Gaussian distribution, and then use these distributions to align each frame to the 3D scene quickly. We report our evaluations on widely used benchmarks, justify our designs, and show advantages over the latest methods in view rendering, camera tracking, runtime, and storage complexity. Please see our project page for code and videos at https://machineperceptionlab.github.io/SGAD-SLAM-Project .
0
cs.CV Yaelle Zribi (ENC), Florian Cafiero (ENC, LRE) et al. · Mar 23, 2026

Stand-up comedy depends as much on timing and embodied presence as on verbal content, yet computational humor has largely focused on text alone. This paper introduces TIC-TALK, a multimodal corpus of 90 professionally filmed Netflix specials (2015–2024) with temporally aligned annotations for language, gesture, and audience response. The processing pipeline combines BERTopic for thematic segmentation, Whisper-AT for laughter detection, and YOLOv8 for shot classification and pose keypoint extraction, all aligned hierarchically without resampling. The authors validate the resource through corpus-level findings including a negative correlation between kinetic energy and laughter rate ($r = -0.75$), consistent with a stillness-before-punchline pattern, and through a short-horizon laughter prediction benchmark.

Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015-2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals-arm spread, kinetic energy, and trunk lean-that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r = -0.75, N = 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r = +0.28), consistent with reactive montage.
0
cs.CV Mingle Zhou, Jiahui Liu, Jin Wan et al. · Mar 23, 2026

This paper tackles Unsupervised Continuous Anomaly Detection (UCAD), where models must sequentially learn new product categories without forgetting previous ones or storing all raw data. The core idea is to augment visual-only approaches with learnable text prompts from CLIP, storing both modalities in a Continuous Multimodal Prompt Memory Bank (CMPMB) and fusing them via a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM). Benchmarked on MVTec AD and VisA, the authors claim state-of-the-art detection accuracy (+4.4% AUROC) and segmentation (+14.8% AUPR) over the prior UCAD baseline.

Unsupervised Continuous Anomaly Detection (UCAD) is gaining attention for effectively addressing the catastrophic forgetting and heavy computational burden issues in traditional Unsupervised Anomaly Detection (UAD). However, existing UCAD approaches that rely solely on visual information are insufficient to capture the manifold of normality in complex scenes, thereby impeding further gains in anomaly detection accuracy. To overcome this limitation, we propose an unsupervised continual anomaly detection framework grounded in multimodal prompting. Specifically, we introduce a Continual Multimodal Prompt Memory Bank (CMPMB) that progressively distills and retains prototypical normal patterns from both visual and textual domains across consecutive tasks, yielding a richer representation of normality. Furthermore, we devise a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM) that integrates an Adaptive Normalization Module (ANM) with a Dynamic Fusion Strategy (DFS) to jointly enhance detection accuracy and adversarial robustness. Benchmark experiments on MVTec AD and VisA datasets show that our approach achieves state-of-the-art (SOTA) performance on image-level AUROC and pixel-level AUPR metrics.
0
cs.CVcs.AIcs.CL Haichao Zhang, Yijiang Li, Shwai He et al. · Mar 23, 2026

ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
0
cs.CV Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick et al. · Mar 22, 2026

CVT-Bench evaluates whether multimodal LLMs can maintain stable spatial representations under counterfactual viewpoint transformations—such as inferring object relationships from a camera angle never shown in the image. Using 100 synthetic tabletop scenes and 6,000 relational queries across rotations from $0^{\circ}$ to $360^{\circ}$, the benchmark reveals that state-of-the-art models, despite high single-view accuracy, systematically fail at mental rotation tasks and degrade further under extended sequential context. These findings challenge the assumption that strong episodic spatial performance implies robust viewpoint-invariant representations, with critical implications for embodied AI and robotics applications requiring perspective-taking.

Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360{\deg} cycle agreement, and relational stability over sequential transformations. Despite high single-view accuracy, state-of-the-art MLLMs exhibit systematic degradation under counterfactual viewpoint changes, with frequent violations of cycle consistency and rapid decay in relational stability. We further evaluate multiple input representations, visual input, textual bounding boxes, and structured scene graphs, and show that increasing representational structure improves stability. Our results suggest that single-view spatial accuracy overestimates the robustness of induced spatial representations and that representation structure plays a critical role in counterfactual spatial reasoning.
0
cs.CV Zengqun Zhao, Yanzuo Lu, Ziquan Liu et al. · Mar 22, 2026

Autoregressive video diffusion models struggle with minute-scale generation due to error accumulation in long-horizon rollouts. This paper challenges the assumption that more memory is better, proposing instead to decompose KV-cache conditioning into three functional roles—Sink for global anchors, Tail for recent continuity, and dynamically selected History for mid-range structure. The result is a training-free inference method that improves motion dynamics by 66.8% while cutting attention overhead by roughly 2.6×.

Autoregressive (AR) video diffusion has recently emerged as a promising paradigm for long video generation, enabling causal synthesis beyond the limits of bidirectional models. To address training-inference mismatch, a series of self-forcing strategies have been proposed to improve rollout stability by conditioning the model on its own predictions during training. While these approaches substantially mitigate exposure bias, extending generation to minute-scale horizons remains challenging due to progressive temporal degradation. In this work, we show that this limitation is not primarily caused by insufficient memory, but by how temporal memory is utilised during inference. Through empirical analysis, we find that increasing memory does not consistently improve long-horizon generation, and that the temporal placement of historical context significantly influences motion dynamics while leaving visual quality largely unchanged. These findings suggest that temporal memory should not be treated as a homogeneous buffer. Motivated by this insight, we introduce Relax Forcing, a structured temporal memory mechanism for AR diffusion. Instead of attending to the dense generated history, Relax Forcing decomposes temporal context into three functional roles: Sink for global stability, Tail for short-term continuity, and dynamically selected History for structural motion guidance, and selectively incorporates only the most relevant past information. This design mitigates error accumulation during extrapolation while preserving motion evolution. Experiments on VBench-Long demonstrate that Relax Forcing improves motion dynamics and overall temporal consistency while reducing attention overhead. Our results suggest that structured temporal memory is essential for scalable long video generation, complementing existing forcing-based training strategies.
0
cs.CVcs.MM Zhiyang Tang, Yiming Zhu, Ruimin Huang et al. · Mar 22, 2026

This paper tackles the Close Small Object Unmixing (CSOU) problem for infrared imagery, where distant clustered targets appear as overlapping mixed spots due to optical diffraction limits. The authors propose DSCSNet, a deep-unfolded network that unrolls the ADMM algorithm with learnable parameters to recover target count, sub-pixel positions, and radiant intensities from mixed spots. The core idea is to replace the traditional ℓ2-norm smoothness terms with strict ℓ1-norm sparsity constraints and add a dynamic thresholding mechanism for scene-adaptive reconstruction.

Due to the limitations of optical lens focal length and detector resolution, distant clustered infrared small targets often appear as mixed spots. The Close Small Object Unmixing (CSOU) task aims to recover the number, sub-pixel positions, and radiant intensities of individual targets from these spots, which is a highly ill-posed inverse problem. Existing methods struggle to balance the rigorous sparsity guarantees of model-driven approaches and the dynamic scene adaptability of data-driven methods. To address this dilemma, this paper proposes a Dynamic Sparse Compressed Sensing Network (DSCSNet), a deep-unfolded network that couples the Alternating Direction Method of Multipliers (ADMM) with learnable parameters. Specifically, we embed a strict $\ell_1$-norm sparsity constraint into the auxiliary variable update step of ADMM to replace the traditional $\ell_2$-norm smoothness-promoting terms, which effectively preserves the discrete energy peaks of small targets. We also integrate a self-attention-based dynamic thresholding mechanism into the reconstruction stage, which adaptively adjusts the sparsification intensity using the sparsity-enhanced information from the iterative process. These modules are jointly optimized end-to-end across the three iterative steps of ADMM. Retaining the physical logic of compressed sensing, DSCSNet achieves robust sparsity induction and scene adaptability, thus enhancing the unmixing accuracy and generalization in complex infrared scenarios. Extensive experiments on the synthetic infrared dataset CSIST-100K demonstrate that DSCSNet outperforms state-of-the-art methods in key metrics such as CSO-mAP and sub-pixel localization error.
0
cs.CV Yuqiu Liu, Jialin Song, Marissa Ramirez de Chanlatte et al. · Mar 22, 2026

FluidGaussian addresses a critical gap in 3D reconstruction: methods optimized solely for photometric losses often produce visually plausible but physically implausible geometries that fail in downstream simulations. The paper proposes coupling 3D Gaussian Splatting with incompressible fluid simulation (SPH/DFSPH) to define a simulation-based uncertainty metric—velocity divergence at the fluid-structure interface—and integrates it into active view selection. By reranking next-best-view candidates using this physical signal, the method improves both visual fidelity (PSNR) and physical plausibility (divergence) on synthetic and aerodynamic datasets.

Real objects that inhabit the physical world follow physical laws and thus behave plausibly during interaction with other physical objects. However, current methods that perform 3D reconstructions of real-world scenes from multi-view 2D images optimize primarily for visual fidelity, i.e., they train with photometric losses and reason about uncertainty in the image or representation space. This appearance-centric view overlooks body contacts and couplings, conflates function-critical regions (e.g., aerodynamic or hydrodynamic surfaces) with ornamentation, and reconstructs structures suboptimally, even when physical regularizers are added. All these can lead to unphysical and implausible interactions. To address this, we consider the question: How can 3D reconstruction become aware of real-world interactions and underlying object functionality, beyond visual cues? To answer this question, we propose FluidGaussian, a plug-and-play method that tightly couples geometry reconstruction with ubiquitous fluid-structure interactions to assess surface quality at high granularity. We define a simulation-based uncertainty metric induced by fluid simulations and integrate it with active learning to prioritize views that improve both visual and physical fidelity. In an empirical evaluation on NeRF Synthetic (Blender), Mip-NeRF 360, and DrivAerNet++, our FluidGaussian method yields up to +8.6% visual PSNR (Peak Signal-to-Noise Ratio) and -62.3% velocity divergence during fluid simulations. Our code is available at https://github.com/delta-lab-ai/FluidGaussian.