Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
186 papers in cs.CV
Trending mixes fresh papers with community signal.
0
cs.CV Zhengyao Lv, Menghan Xia, Xintao Wang et al. · Mar 23, 2026

DUO-VSR tackles the prohibitive sampling cost of diffusion-based video super-resolution by enabling efficient one-step generation. The paper identifies critical limitations when applying Distribution Matching Distillation (DMD) to VSR—specifically training instability, degraded supervision from frozen score models, and insufficient guidance capped by teacher quality—and proposes a dual-stream strategy that unifies DMD with adversarial supervision via Real–Fake Score Feature GAN (RFS-GAN). This three-stage pipeline achieves approximately $50\times$ speedup over multi-step counterparts while delivering superior perceptual quality, making high-fidelity video upscaling practical for real-world deployment.

Diffusion-based video super-resolution (VSR) has recently achieved remarkable fidelity but still suffers from prohibitive sampling costs. While distribution matching distillation (DMD) can accelerate diffusion models toward one-step generation, directly applying it to VSR often results in training instability alongside degraded and insufficient supervision. To address these issues, we propose DUO-VSR, a three-stage framework built upon a Dual-Stream Distillation strategy that unifies distribution matching and adversarial supervision for one-step VSR. Firstly, a Progressive Guided Distillation Initialization is employed to stabilize subsequent training through trajectory-preserving distillation. Next, the Dual-Stream Distillation jointly optimizes the DMD and Real-Fake Score Feature GAN (RFS-GAN) streams, with the latter providing complementary adversarial supervision leveraging discriminative features from both real and fake score models. Finally, a Preference-Guided Refinement stage further aligns the student with perceptual quality preferences. Extensive experiments demonstrate that DUO-VSR achieves superior visual quality and efficiency over previous one-step VSR approaches.
0
cs.CV Wang Zhou, Boran Duan, Haojun Ai et al. · Mar 23, 2026

ALADIN tackles person Re-identification by distilling fine-grained attribute knowledge from a frozen CLIP teacher into a lightweight student network. The core innovation uses a Multimodal LLM (Qwen-VL) to generate structured attribute descriptions, which are converted via CLIP into spatial attention maps for supervising local feature alignment. A Scene-Aware Prompt Generator (SAPG) creates image-specific soft prompts via $\mathbf{p}=\mathrm{MLP}(\mathbf{f}_{g})$ to adapt text embeddings to surveillance scenes. At inference, only the student runs, promising deployable efficiency.

Recent vision-language models such as CLIP provide strong cross-modal alignment, but current CLIP-guided ReID pipelines rely on global features and fixed prompts. This limits their ability to capture fine-grained attribute cues and adapt to diverse appearances. We propose ALADIN, an attribute-language distillation network that distills knowledge from a frozen CLIP teacher to a lightweight ReID student. ALADIN introduces fine-grained attribute-local alignment to establish adaptive text-visual correspondence and robust representation learning. A Scene-Aware Prompt Generator produces image-specific soft prompts to facilitate adaptive alignment. Attribute-local distillation enforces consistency between textual attributes and local visual features, significantly enhancing robustness under occlusions. Furthermore, we employ cross-modal contrastive and relation distillation to preserve the inherent structural relationships among attributes. To provide precise supervision, we leverage Multimodal LLMs to generate structured attribute descriptions, which are then converted into localized attention maps via CLIP. At inference, only the student is used. Experiments on Market-1501, DukeMTMC-reID, and MSMT17 show improvements over CNN-, Transformer-, and CLIP-based methods, with better generalization and interpretability.
0
cs.CV Yupeng Zhang, Ruize Han, Zhiwei Chen et al. · Mar 22, 2026

NoOVD tackles a critical issue in open-vocabulary object detection (OVD): during training, novel-category objects are forcibly aligned with background embeddings, causing them to be filtered out by the RPN and misclassified by the RoI head. The authors propose a framework built on frozen CLIP that identifies latent novel objects during training via generic text prompts (e.g., 'This is an object, specifically an animal') and integrates them through self-distillation. At test time, a Re-weighted RPN (R-RPN) boosts proposal scores using CLIP-based knowledge to improve novel-category recall. The method aims to eliminate the training-inference gap without requiring additional labeled data or pseudo-labeling noise.

Despite the remarkable progress in open-vocabulary object detection (OVD), a significant gap remains between the training and testing phases. During training, the RPN and RoI heads often misclassify unlabeled novel-category objects as background, causing some proposals to be prematurely filtered out by the RPN while others are further misclassified by the RoI head. During testing, these proposals again receive low scores and are removed in post-processing, leading to a significant drop in recall and ultimately weakening novel-category detection performance.To address these issues, we propose a novel training framework-NoOVD-which innovatively integrates a self-distillation mechanism grounded in the knowledge of frozen vision-language models (VLMs). Specifically, we design K-FPN, which leverages the pretrained knowledge of VLMs to guide the model in discovering novel-category objects and facilitates knowledge distillation-without requiring additional data-thus preventing forced alignment of novel objects with background.Additionally, we introduce R-RPN, which adjusts the confidence scores of proposals during inference to improve the recall of novel-category objects. Cross-dataset evaluations on OV-LVIS, OV-COCO, and Objects365 demonstrate that our approach consistently achieves superior performance across multiple metrics.
0
cs.CV Thomas Savage, Evan Madill · Mar 22, 2026

This paper investigates whether video transformers can detect respiratory distress from video recordings of post-exercise recovery. The authors frame the problem as a temporal ordering task—predicting which of two clips shows greater shortness of breath—and propose augmenting ViViT with Lie Relative Encodings (LieRE) and Motion-Guided Masking (MGM). An F1 score of 0.81 is achieved, though on only 7 test videos from 3 participants.

Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.
0
cs.CV Bingxuan Zhao, Qing Zhou, Chuang Yang et al. · Mar 23, 2026

Remote sensing text-to-image generation suffers from a lack of domain-specific diffusion transformers and prohibitive costs for high-resolution training. Existing training-free resolution promotion methods apply static RoPE scaling that uniformly compresses the spatial spectrum, which is particularly harmful for RS imagery due to its characteristically denser high-frequency energy. This paper proposes SHARP, a spectrum-aware dynamic adaptation strategy that uses a rational decay scheduler $\kappa_{rs}(t)$ to apply strong positional extrapolation early in denoising (for layout formation) while progressively relaxing it later (for detail recovery). The approach enables robust multi-scale generation up to 2.5$\times$ extrapolation factors with negligible overhead, addressing a critical gap in large-scale RS synthesis.

Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.
0
cs.CV Jan Boysen, Hristina Uzunova, Heinz Handels et al. · Mar 23, 2026

Accurate respiratory motion modeling is critical for radiotherapy precision, yet patient-specific breathing patterns are difficult to predict outside observed ranges. This paper proposes PRISM-RM, a trajectory-aware implicit neural representation (INR) that models lung motion as a continuous diffeomorphic flow driven by external surrogate signals. By integrating neohookean hyperelastic constraints with temporal total-variation regularization, the method eliminates the need for fixed reference breathing states and aims to improve extrapolation to unseen respiratory phases.

A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.
0
cs.CV Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi et al. · Mar 23, 2026

Pointing-based methods improve Large Vision-Language Models (LVLMs) by grounding objects before answering, yet the underlying mechanism remains unclear. This work investigates why pointing helps by comparing Direct Counting against Point-then-Count (PtC) in zero-shot counting tasks using synthetic data with controlled spatial layouts. The authors find that intermediate coordinate supervision encourages skill learning rather than narrow task memorization, yielding stronger out-of-distribution generalization while providing verifiable visual explanations.

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
0
cs.CV Yu-Shan Tai, An-Yeu (Andy) Wu · Mar 22, 2026

Diffusion models generate high-quality images but require hundreds of denoising steps, making deployment on edge devices impractical. This paper proposes Coarse-to-Fine Diffusion Models that start with low-resolution denoising early in the process (when outputs are noisy anyway) before switching to high-resolution, plus a fast time-step search method that finds good sampling schedules in under 10 minutes instead of days.

Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.
0
cs.CV Wenjin Hou, Xiaoxiao Sun, Hehe Fan · Mar 22, 2026

Generative zero-shot learning (ZSL) synthesizes visual features for unseen classes conditioned on semantic prototypes, but existing methods often produce task-agnostic features that overlap for semantically similar yet visually distinct categories. This paper proposes RLVC, an outcome-reward reinforcement learning framework that treats the feature generator as a policy model and optimizes it using classifier confidence as the reward signal. The method further incorporates class-wise visual prototypes via a distillation loss to align synthesized features with real data distributions, achieving reported state-of-the-art results on CUB, SUN, and AWA2 benchmarks.

Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.
0
cs.CV Purui Bai, Junxian Duan, Pin Wang et al. · Mar 23, 2026

This paper tackles real-world image restoration (Real-IR) by adapting the 12B-parameter FLUX.1-dev flow matching model to low-level vision tasks. The core innovation is ResFlow-Tuner, which combines Unified Multi-Modal Fusion (UMMF) of image and text cues with a novel test-time scaling (TTS) paradigm that greedily optimizes ODE sampling trajectories using a multi-reward ensemble during inference. This establishes a new compute-quality trade-off for generative image restoration, showing that carefully perturbing intermediate flow states can yield substantial perceptual gains without retraining the base model.

Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
0
cs.CVcs.LGeess.IV Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel et al. · Mar 23, 2026

The paper tackles the computational bottleneck of radiative transfer models (RTMs) for hyperspectral image (HSI) generation by proposing a VAE-based emulation framework that learns latent representations conditioned on biophysical parameters. It introduces both pixel-to-pixel (P2P) and fully convolutional (FC-VAE) variants, trained via either direct one-step mapping or a two-step pretraining strategy that decouples representation learning from parameter-to-latent interpolation. The work is significant for remote sensing applications as it provides empirical evidence that optimal emulator architecture depends critically on whether the target data is simulated (where P2P excels) or real-world imagery (where FC-VAE-pre dominates), and demonstrates that emulated data preserves downstream utility for parameter retrieval tasks.

Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.
0
cs.CV Yixuan Luo, Feng Qiao, Zhexiao Xiong et al. · Mar 23, 2026

Optical flow estimation traditionally requires expensive ground-truth annotations or relies on unreliable brightness constancy assumptions that fail under occlusion and illumination changes. This paper introduces GenOpticalFlow, a framework that synthesizes perfectly aligned training pairs by using monocular depth estimates to generate pseudo-optical flow, then conditioning a latent diffusion model to render corresponding next frames. The core innovation is converting unsupervised optical flow learning into a supervised training paradigm using synthetic data with geometrically consistent motion fields, potentially eliminating the need for manual annotation at scale.

Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
0
cs.CV Jiawei Chen, Zhe Chen, Chaoqun Du et al. · Mar 23, 2026

StreamingClaw addresses real-time streaming video understanding for embodied intelligence applications such as autonomous driving and robotics. The framework unifies continuous perception, hierarchical multimodal memory, and proactive interaction through a main–sub-agent architecture where StreamingReasoning orchestrates StreamingMemory and StreamingProactivity sub-agents. By integrating incremental KV-cache reuse with dynamic pruning, memory evolution from atomic actions to events, and trigger-based proactive responses, it aims to close the perception–decision–action loop for physical world deployment.

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.
0
cs.CV Xinghan Li, Junhao Xu, Jingjing Chen · Mar 23, 2026

VIGIL tackles hallucination in multimodal deepfake detection by decoupling claim generation from evidence sourcing through a part-centric plan-then-examine pipeline. The framework first plans which facial parts to inspect using global visual cues, then examines each part with independently sourced forensic evidence delivered via a stage-gated injection mechanism. Combined with a progressive three-stage training paradigm featuring part-aware reinforcement learning rewards, the method aims to produce verifiable, anatomically grounded explanations rather than confabulated reasoning chains.

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
0
cs.CV Wenqing Tian, Hanyi Mao, Zhaocheng Liu et al. · Mar 23, 2026

MultiBind targets a critical blind spot in evaluating multi-subject image generators: cross-subject attribute misbinding, where models assign jackets, smiles, or poses to the wrong person. The benchmark grounds each test case in a real photograph (508 instances, 2–4 human subjects each) and provides slot-ordered crops, masks, background references, and long entity-indexed prompts (~474 words). Its core technical idea is the delta-matrix evaluation: for each attribute dimension $d$, compute $\Delta^{(d)} = S_{\mathrm{gen}}^{(d)} - S_{\mathrm{gt}}^{(d)}$, subtracting ground-truth subject similarities from generated-to-ground-truth similarities to isolate generation-induced confusion from natural subject resemblance. This separates self-degradation (diagonal) from cross-subject interference (off-diagonal) and exposes interpretable failure modes—drift, swap, dominance, and blending—that holistic metrics like CLIP or FID miss.

Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.
0
cs.CV Hwasik Jeong, Seungryong Lee, Gyeongjin Kang et al. · Mar 22, 2026

This paper challenges the monolithic paradigm in pose-free feed-forward 3D Gaussian Splatting (3DGS), where a single network jointly estimates camera poses and synthesizes Gaussians. The authors propose 2Xplat, a modular two-expert framework that decouples geometry estimation (using Depth Anything 3) from appearance synthesis (using Multi-view Pyramid Transformer) via an explicit pose interface. The core claim is that separating these concerns enables superior training efficiency (<5K iterations) and novel-view synthesis quality competitive with posed methods, challenging the assumption that unified architectures are optimal.

Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such &#34;all-in-one&#34; designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
0
cs.CV Wen Guo (1), Pengfei Zhao (1), Zongmeng Wang (4) et al. · Mar 23, 2026

Multi-Object Tracking (MOT) models often degrade during inference due to distribution shifts between training and test data. This paper proposes TCEI (Test-time Calibration from Experience and Intuition), a cognitive-inspired framework that uses transient memory for short-term guidance and accumulated experience for long-term calibration. Unlike traditional TTA methods that require backpropagation, TCEI operates entirely via forward propagation, adapting identity predictions in real-time without additional training.

Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.
0
cs.CV Pengxiang Cai, Mengyang Li · Mar 22, 2026

MS-CustomNet tackles multi-subject customization for text-to-image diffusion models, where the challenge is to preserve multiple subject identities while controlling their compositional arrangement and spatial relationships. The authors propose a framework built on CustomNet that accepts multiple reference images plus a layout map $M_L$ specifying spatial arrangement, trained on a curated MSI dataset derived from COCO. The work aims to provide explicit deterministic control over subject placement and layering (e.g., "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.

Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
0
cs.CV Lanbo Xu, Liang Guo, Caigui Jiang et al. · Mar 22, 2026

PAS3R tackles online monocular 3D reconstruction from long video streams, addressing the stability–adaptation dilemma where models must incorporate novel viewpoints without overwriting historical scene structure. The core idea is to dynamically modulate state update intensity based on geometric novelty: measuring inter-frame camera displacement (translation + rotation) and image frequency content via Fourier analysis. This enables faster adaptation to abrupt viewpoint changes while preserving accumulated geometry during smooth motion.

Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.
0
cs.CV Haolan Xu, Keli Cheng, Lei Wang et al. · Mar 22, 2026

EmoTaG tackles few-shot 3D talking-head synthesis with emotional expressiveness using only 5 seconds of target video. The core insight is to predict FLAME parameters (expression and jaw pose) rather than directly deforming 3D Gaussians, providing explicit geometric priors for stability. A Gated Residual Motion Network (GRMN) disentangles phonetic articulation from emotion-driven variations with a learned gate $g \in [0,1]$, while Semantic Emotion Guidance distills knowledge from a pretrained DeepFace recognizer to supervise emotional intensity without manual labels.

Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.