Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.CV Thomas Savage, Evan Madill · Mar 22, 2026

This paper investigates whether video transformers can detect respiratory distress from video recordings of post-exercise recovery. The authors frame the problem as a temporal ordering task—predicting which of two clips shows greater shortness of breath—and propose augmenting ViViT with Lie Relative Encodings (LieRE) and Motion-Guided Masking (MGM). An F1 score of 0.81 is achieved, though on only 7 test videos from 3 participants.

Recognition of respiratory distress through visual inspection is a life saving clinical skill. Clinicians can detect early signs of respiratory deterioration, creating a valuable window for earlier intervention. In this study, we evaluate whether recent advances in video transformers can enable Artificial Intelligence systems to recognize the signs of respiratory distress from video. We collected videos of healthy volunteers recovering after strenuous exercise and used the natural recovery of each participants respiratory status to create a labeled dataset for respiratory distress. Splitting the video into short clips, with earlier clips corresponding to more shortness of breath, we designed a temporal ordering challenge to assess whether an AI system can detect respiratory distress. We found a ViViT encoder augmented with Lie Relative Encodings (LieRE) and Motion Guided Masking, combined with an embedding based comparison strategy, can achieve an F1 score of 0.81 on this task. Our findings suggest that modern video transformers can recognize subtle changes in respiratory mechanics.
0
cs.CV Bingxuan Zhao, Qing Zhou, Chuang Yang et al. · Mar 23, 2026

Remote sensing text-to-image generation suffers from a lack of domain-specific diffusion transformers and prohibitive costs for high-resolution training. Existing training-free resolution promotion methods apply static RoPE scaling that uniformly compresses the spatial spectrum, which is particularly harmful for RS imagery due to its characteristically denser high-frequency energy. This paper proposes SHARP, a spectrum-aware dynamic adaptation strategy that uses a rational decay scheduler $\kappa_{rs}(t)$ to apply strong positional extrapolation early in denoising (for layout formation) while progressively relaxing it later (for detail recovery). The approach enables robust multi-scale generation up to 2.5$\times$ extrapolation factors with negligible overhead, addressing a critical gap in large-scale RS synthesis.

Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.
0
cs.CLcs.SD Abner Hernandez, Eunjung Yeo, Kwanghee Choi et al. · Mar 23, 2026

Cross-lingual dysarthria detection in Parkinson's disease is hampered by language-dependent structure in self-supervised speech representations that confounds pathology classification. This paper proposes a centroid-based 'language shift' (LS) that aligns source-language embeddings toward target-language distributions using only healthy control speech, enabling zero-shot transfer without model retraining. The approach addresses the critical data scarcity in clinical speech applications while aiming to disentangle linguistic variation from motor impairment markers.

The limited availability of dysarthric speech data makes cross-lingual detection an important but challenging problem. A key difficulty is that speech representations often encode language-dependent structure that can confound dysarthria detection. We propose a representation-level language shift (LS) that aligns source-language self-supervised speech representations with the target-language distribution using centroid-based vector adaptation estimated from healthy-control speech. We evaluate the approach on oral DDK recordings from Parkinson's disease speech datasets in Czech, German, and Spanish under both cross-lingual and multilingual settings. LS substantially improves sensitivity and F1 in cross-lingual settings, while yielding smaller but consistent gains in multilingual settings. Representation analysis further shows that LS reduces language identity in the embedding space, supporting the interpretation that LS removes language-dependent structure.
0
cs.CV Jan Boysen, Hristina Uzunova, Heinz Handels et al. · Mar 23, 2026

Accurate respiratory motion modeling is critical for radiotherapy precision, yet patient-specific breathing patterns are difficult to predict outside observed ranges. This paper proposes PRISM-RM, a trajectory-aware implicit neural representation (INR) that models lung motion as a continuous diffeomorphic flow driven by external surrogate signals. By integrating neohookean hyperelastic constraints with temporal total-variation regularization, the method eliminates the need for fixed reference breathing states and aims to improve extrapolation to unseen respiratory phases.

A precise spatial delivery of the radiation dose is crucial for the treatment success in radiotherapy. In the lung and upper abdominal region, respiratory motion introduces significant treatment uncertainties, requiring special motion management techniques. To address this, respiratory motion models are commonly used to infer the patient-specific respiratory motion and target the dose more efficiently. In this work, we investigate the possibility of using implicit neural representations (INR) for surrogate-based motion modeling. Therefore, we propose physics-regularized implicit surrogate-based modeling for respiratory motion (PRISM-RM). Our new integrated respiratory motion model is free of a fixed reference breathing state. Unlike conventional pairwise registration techniques, our approach provides a trajectory-aware spatio-temporally continuous and diffeomorphic motion representation, improving generalization to extrapolation scenarios. We introduce biophysical constraints, ensuring physiologically plausible motion estimation across time beyond the training data. Our results show that our trajectory-aware approach performs on par in interpolation and improves the extrapolation ability compared to our initially proposed INR-based approach. Compared to sequential registration-based approaches both our approaches perform equally well in interpolation, but underperform in extrapolation scenarios. However, the methodical features of INRs make them particularly effective for respiratory motion modeling, and with their performance steadily improving, they demonstrate strong potential for advancing this field.
0
cs.LGcs.CY Vagish Kumar, Syed Bahauddin Alam, Souvik Chakraborty · Mar 23, 2026

Federated learning enables privacy-preserving medical AI but struggles with unreliable uncertainty estimates when clinical data is heterogeneous and imbalanced across sites. TrustFed addresses this by introducing representation-aware conformal prediction, which assigns test samples to calibration clients based on feature-space similarity and aggregates local thresholds via a soft-nearest strategy to provide finite-sample coverage guarantees without centralizing raw data. Validated on over 430,000 images across six distinct imaging modalities, the work advances federated learning from privacy-preserving training toward clinically trustworthy deployment with statistically calibrated uncertainty.

Protecting patient privacy remains a fundamental barrier to scaling machine learning across healthcare institutions, where centralizing sensitive data is often infeasible due to ethical, legal, and regulatory constraints. Federated learning offers a promising alternative by enabling privacy-preserving, multi-institutional training without sharing raw patient data; however, real-world deployments face severe challenges from data heterogeneity, site-specific biases, and class imbalance, which degrade predictive reliability and render existing uncertainty quantification methods ineffective. Here, we present TrustFed, a federated uncertainty quantification framework that provides distribution-free, finite-sample coverage guarantees under heterogeneous and imbalanced healthcare data, without requiring centralized access. TrustFed introduces a representation-aware client assignment mechanism that leverages internal model representations to enable effective calibration across institutions, along with a soft-nearest threshold aggregation strategy that mitigates assignment uncertainty while producing compact and reliable prediction sets. Using over 430,000 medical images across six clinically distinct imaging modalities, we conduct one of the most comprehensive evaluations of uncertainty-aware federated learning in medical imaging, demonstrating robust coverage guarantees across datasets with diverse class cardinalities and imbalance regimes. By validating TrustFed at this scale and breadth, our study advances uncertainty-aware federated learning from proof-of-concept toward clinically meaningful, modality-agnostic deployment, positioning statistically guaranteed uncertainty as a core requirement for next-generation healthcare AI systems.
0
cs.CV Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi et al. · Mar 23, 2026

Pointing-based methods improve Large Vision-Language Models (LVLMs) by grounding objects before answering, yet the underlying mechanism remains unclear. This work investigates why pointing helps by comparing Direct Counting against Point-then-Count (PtC) in zero-shot counting tasks using synthetic data with controlled spatial layouts. The authors find that intermediate coordinate supervision encourages skill learning rather than narrow task memorization, yielding stronger out-of-distribution generalization while providing verifiable visual explanations.

Pointing increases the accuracy and explainability of Large Vision-Language Models (LVLMs) by modeling grounding and reasoning as explicit sequential steps. The model grounds the objects mentioned in the natural-language query by predicting their coordinates, and then generates an answer conditioned on these points. While pointing has been shown to increase LVLMs' accuracy, it is unclear which mechanism supports these gains and its relevance in cognitive tasks. In addition, the reliability of the intermediate points remains understudied, limiting their use as visual explanations. In this work, we study the role of pointing in a cognitive task: zero-shot counting from a visual scene. We fine-tune state-of-the-art LVLMs following two approaches: Direct Counting, where models only predict the total number of objects, and Point-then-Count, where LVLMs generate the target objects' coordinates followed by their count. The results show that Point-then-Count achieves higher out-of-distribution generalization, suggesting that coordinates help LVLMs learn skills rather than overfitting on narrow tasks. Although predicted points are accurately grounded in the image in over 89\% of cases (as measured by F1), performance varies across image regions, revealing spatial biases. Finally, mechanistic analyses show that gains in counting arise from the spatial information encoded in the coordinates.
0
eess.AScs.CLcs.SD Xi Xuan, Wenxin Zhang, Zhiyu Li et al. · Mar 23, 2026

This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.

Speech deepfake source verification systems aims to determine whether two synthetic speech utterances originate from the same source generator, often assuming that the resulting source embeddings are independent of speaker traits. However, this assumption remains unverified. In this paper, we first investigate the impact of speaker factors on source verification. We propose a speaker-disentangled metric learning (SDML) framework incorporating two novel loss functions. The first leverages Chebyshev polynomial to mitigate gradient instability during disentanglement optimization. The second projects source and speaker embeddings into hyperbolic space, leveraging Riemannian metric distances to reduce speaker information and learn more discriminative source features. Experimental results on MLAAD benchmark, evaluated under four newly proposed protocols designed for source-speaker disentanglement scenarios, demonstrate the effectiveness of SDML framework. The code, evaluation protocols and demo website are available at https://github.com/xxuan-acoustics/RiemannSD-Net.
0
cs.CV Yu-Shan Tai, An-Yeu (Andy) Wu · Mar 22, 2026

Diffusion models generate high-quality images but require hundreds of denoising steps, making deployment on edge devices impractical. This paper proposes Coarse-to-Fine Diffusion Models that start with low-resolution denoising early in the process (when outputs are noisy anyway) before switching to high-resolution, plus a fast time-step search method that finds good sampling schedules in under 10 minutes instead of days.

Recently, diffusion models (DMs) have made significant strides in high-quality image generation. However, the multi-step denoising process often results in considerable computational overhead, impeding deployment on resource-constrained edge devices. Existing methods mitigate this issue by compressing models and adjusting the time step sequence. However, they overlook input redundancy and require lengthy search times. In this paper, we propose Coarse-to-Fine Diffusion Models with Time Step Sequence Redistribution. Recognizing indistinguishable early-stage generated images, we introduce Coarse-to-Fine Denoising (C2F) to reduce computation during coarse feature generation. Furthermore, we design Time Step Sequence Redistribution (TRD) for efficient sampling trajectory adjustment, requiring less than 10 minutes for search. Experimental results demonstrate that the proposed methods achieve near-lossless performance with an 80% to 90% reduction in computation on CIFAR10 and LSUN-Church.
0
cs.CLcs.LG Chi Zhang, Xixi Hu, Bo Liu et al. · Mar 23, 2026

Parallel decoding promises faster text generation than autoregressive models but historically sacrifices quality due to simplified conditional independence assumptions. This paper introduces Gumbel Distillation, which leverages the Gumbel-Max trick to create a deterministic mapping from latent noise to teacher outputs, effectively providing the parallel student a blueprint for joint token distributions. By conditioning on Gumbel noise rather than relying on naive factorization, the method narrows the quality-efficiency gap, delivering substantial improvements across masked diffusion and multi-token prediction architectures.

The slow, sequential nature of autoregressive (AR) language models has driven the adoption of parallel decoding methods. However, these non-AR models often sacrifice generation quality as they struggle to model the complex joint distribution of token sequences. To narrow this performance gap, we introduce Gumbel Distillation, a novel distillation technique that enables parallel decoders to learn this distribution effectively. Our method leverages the Gumbel-Max trick to create a deterministic mapping from a latent Gumbel noise space to the output tokens of a high-performing AR teacher. As a model-agnostic technique, Gumbel Distillation seamlessly integrates with diverse parallel decoding architectures, including MDLM and BD3-LM. Experiments on LM1B and OpenWebText show that Gumbel Distillation substantially improves the generation quality of parallel language models, achieving a 30.0% improvement in MAUVE score and 10.5% in generative perplexity over MDLM trained on OpenWebText dataset. Code available at https://github.com/hxixixh/gumbel-distill.
0
cs.CV Wenjin Hou, Xiaoxiao Sun, Hehe Fan · Mar 22, 2026

Generative zero-shot learning (ZSL) synthesizes visual features for unseen classes conditioned on semantic prototypes, but existing methods often produce task-agnostic features that overlap for semantically similar yet visually distinct categories. This paper proposes RLVC, an outcome-reward reinforcement learning framework that treats the feature generator as a policy model and optimizes it using classifier confidence as the reward signal. The method further incorporates class-wise visual prototypes via a distillation loss to align synthesized features with real data distributions, achieving reported state-of-the-art results on CUB, SUN, and AWA2 benchmarks.

Recent advances in zero-shot learning (ZSL) have demonstrated the potential of generative models. Typically, generative ZSL synthesizes visual features conditioned on semantic prototypes to model the data distribution of unseen classes, followed by training a classifier on the synthesized data. However, the synthesized features often remain task-agnostic, leading to degraded performance. Moreover, inferring a faithful distribution from semantic prototypes alone is insufficient for classes that are semantically similar but visually distinct. To address these and advance ZSL, we propose RLVC, an outcome-reward reinforcement learning RL framework with visual cues for generative ZSL. At its core, RL empowers the generative model to self-evolve, implicitly enhancing its generation capability. In particular, RLVC updates the generative model using an outcome-based reward, encouraging the synthesis of task-relevant features. Furthermore, we introduce class-wise visual cues that (i) align synthesized features with visual prototypes and (ii) stabilize the RL training updates. For the training process, we present a novel cold-start strategy. Comprehensive experiments and analyses on three prevalent ZSL benchmarks demonstrate that RLVC achieves state-of-the-art results with a 4.7% gain.
0
cs.CV Purui Bai, Junxian Duan, Pin Wang et al. · Mar 23, 2026

This paper tackles real-world image restoration (Real-IR) by adapting the 12B-parameter FLUX.1-dev flow matching model to low-level vision tasks. The core innovation is ResFlow-Tuner, which combines Unified Multi-Modal Fusion (UMMF) of image and text cues with a novel test-time scaling (TTS) paradigm that greedily optimizes ODE sampling trajectories using a multi-reward ensemble during inference. This establishes a new compute-quality trade-off for generative image restoration, showing that carefully perturbing intermediate flow states can yield substantial perceptual gains without retraining the base model.

Although diffusion-based real-world image restoration (Real-IR) has achieved remarkable progress, efficiently leveraging ultra-large-scale pre-trained text-to-image (T2I) models and fully exploiting their potential remain significant challenges. To address this issue, we propose ResFlow-Tuner, an image restoration framework based on the state-of-the-art flow matching model, FLUX.1-dev, which integrates unified multi-modal fusion (UMMF) with test-time scaling (TTS) to achieve unprecedented restoration performance. Our approach fully leverages the advantages of the Multi-Modal Diffusion Transformer (MM-DiT) architecture by encoding multi-modal conditions into a unified sequence that guides the synthesis of high-quality images. Furthermore, we introduce a training-free test-time scaling paradigm tailored for image restoration. During inference, this technique dynamically steers the denoising direction through feedback from a reward model (RM), thereby achieving significant performance gains with controllable computational overhead. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple standard benchmarks. This work not only validates the powerful capabilities of the flow matching model in low-level vision tasks but, more importantly, proposes a novel and efficient inference-time scaling paradigm suitable for large pre-trained models.
0
cs.CVcs.LGeess.IV Chedly Ben Azizi, Claire Guilloteau, Gilles Roussel et al. · Mar 23, 2026

The paper tackles the computational bottleneck of radiative transfer models (RTMs) for hyperspectral image (HSI) generation by proposing a VAE-based emulation framework that learns latent representations conditioned on biophysical parameters. It introduces both pixel-to-pixel (P2P) and fully convolutional (FC-VAE) variants, trained via either direct one-step mapping or a two-step pretraining strategy that decouples representation learning from parameter-to-latent interpolation. The work is significant for remote sensing applications as it provides empirical evidence that optimal emulator architecture depends critically on whether the target data is simulated (where P2P excels) or real-world imagery (where FC-VAE-pre dominates), and demonstrates that emulated data preserves downstream utility for parameter retrieval tasks.

Synthetic hyperspectral image (HSI) generation is essential for large-scale simulation, algorithm development, and mission design, yet traditional radiative transfer models remain computationally expensive and often limited to spectrum-level outputs. In this work, we propose a latent representation-based framework for hyperspectral emulation that learns a latent generative representation of hyperspectral data. The proposed approach supports both spectrum-level and spatial-spectral emulation and can be trained either in a direct one-step formulation or in a two-step strategy that couples variational autoencoder (VAE) pretraining with parameter-to-latent interpolation. Experiments on PROSAIL-simulated vegetation data and Sentinel-3 OLCI imagery demonstrate that the method outperforms classical regression-based emulators in reconstruction accuracy, spectral fidelity, and robustness to real-world spatial variability. We further show that emulated HSIs preserve performance in downstream biophysical parameter retrieval, highlighting the practical relevance of emulated data for remote sensing applications.
0
cs.CV Yixuan Luo, Feng Qiao, Zhexiao Xiong et al. · Mar 23, 2026

Optical flow estimation traditionally requires expensive ground-truth annotations or relies on unreliable brightness constancy assumptions that fail under occlusion and illumination changes. This paper introduces GenOpticalFlow, a framework that synthesizes perfectly aligned training pairs by using monocular depth estimates to generate pseudo-optical flow, then conditioning a latent diffusion model to render corresponding next frames. The core innovation is converting unsupervised optical flow learning into a supervised training paradigm using synthetic data with geometrically consistent motion fields, potentially eliminating the need for manual annotation at scale.

Optical flow estimation is a fundamental problem in computer vision, yet the reliance on expensive ground-truth annotations limits the scalability of supervised approaches. Although unsupervised and semi-supervised methods alleviate this issue, they often suffer from unreliable supervision signals based on brightness constancy and smoothness assumptions, leading to inaccurate motion estimation in complex real-world scenarios. To overcome these limitations, we introduce \textbf{\modelname}, a novel framework that synthesizes large-scale, perfectly aligned frame--flow data pairs for supervised optical flow training without human annotations. Specifically, our method leverages a pre-trained depth estimation network to generate pseudo optical flows, which serve as conditioning inputs for a next-frame generation model trained to produce high-fidelity, pixel-aligned subsequent frames. This process enables the creation of abundant, high-quality synthetic data with precise motion correspondence. Furthermore, we propose an \textit{inconsistent pixel filtering} strategy that identifies and removes unreliable pixels in generated frames, effectively enhancing fine-tuning performance on real-world datasets. Extensive experiments on KITTI2012, KITTI2015, and Sintel demonstrate that \textbf{\modelname} achieves competitive or superior results compared to existing unsupervised and semi-supervised approaches, highlighting its potential as a scalable and annotation-free solution for optical flow learning. We will release our code upon acceptance.
0
cs.LG Yunchi Yang, Longlong Li, Jianliang Wu et al. · Mar 23, 2026

Next app prediction struggles when user intent shifts rapidly and historical profiles are sparse. MISApp tackles this via multi-hop session graphs that decompose transitions into 1-, 2-, and 3-hop structural ranges, using LightGCN for lightweight propagation and a Transformer encoder-decoder to model intent evolution without requiring static user profiles, aiming for robust cold-start performance.

Predicting the next mobile app a user will launch is essential for proactive mobile services. Yet accurate prediction remains challenging in real-world settings, where user intent can shift rapidly within short sessions and user-specific historical profiles are often sparse or unavailable, especially under cold-start conditions. Existing approaches mainly model app usage as sequential behavior or local session transitions, limiting their ability to capture higher-order structural dependencies and evolving session intent. To address this issue, we propose MISApp, a profile-free framework for next app prediction based on multi-hop session graph learning. MISApp constructs multi-hop session graphs to capture transition dependencies at different structural ranges, learns session representations through lightweight graph propagation, incorporates temporal and spatial context to characterize session conditions, and captures intent evolution from recent interactions. Experiments on two real-world app usage datasets show that MISApp consistently outperforms competitive baselines under both standard and cold-start settings, while maintaining a favorable balance between predictive accuracy and practical efficiency. Further analyses show that the learned hop-level attention weights align well with structural relevance, offering interpretable evidence for the effectiveness of the proposed multi-hop modeling strategy.
0
cs.CV Jiawei Chen, Zhe Chen, Chaoqun Du et al. · Mar 23, 2026

StreamingClaw addresses real-time streaming video understanding for embodied intelligence applications such as autonomous driving and robotics. The framework unifies continuous perception, hierarchical multimodal memory, and proactive interaction through a main–sub-agent architecture where StreamingReasoning orchestrates StreamingMemory and StreamingProactivity sub-agents. By integrating incremental KV-cache reuse with dynamic pruning, memory evolution from atomic actions to events, and trigger-based proactive responses, it aims to close the perception–decision–action loop for physical world deployment.

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.
0
cs.CV Xinghan Li, Junhao Xu, Jingjing Chen · Mar 23, 2026

VIGIL tackles hallucination in multimodal deepfake detection by decoupling claim generation from evidence sourcing through a part-centric plan-then-examine pipeline. The framework first plans which facial parts to inspect using global visual cues, then examines each part with independently sourced forensic evidence delivered via a stage-gated injection mechanism. Combined with a progressive three-stage training paradigm featuring part-aware reinforcement learning rewards, the method aims to produce verifiable, anatomically grounded explanations rather than confabulated reasoning chains.

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
0
cs.SDcs.LG Risa Shinoda, Kaede Shiohara, Nakamasa Inoue et al. · Mar 23, 2026

AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species annotated with 22 ecological traits. This matters because it enables automated monitoring in visually occluded habitats like dense forests while inferring biological traits directly from sound.

Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.
0
cs.CV Wenqing Tian, Hanyi Mao, Zhaocheng Liu et al. · Mar 23, 2026

MultiBind targets a critical blind spot in evaluating multi-subject image generators: cross-subject attribute misbinding, where models assign jackets, smiles, or poses to the wrong person. The benchmark grounds each test case in a real photograph (508 instances, 2–4 human subjects each) and provides slot-ordered crops, masks, background references, and long entity-indexed prompts (~474 words). Its core technical idea is the delta-matrix evaluation: for each attribute dimension $d$, compute $\Delta^{(d)} = S_{\mathrm{gen}}^{(d)} - S_{\mathrm{gt}}^{(d)}$, subtracting ground-truth subject similarities from generated-to-ground-truth similarities to isolate generation-induced confusion from natural subject resemblance. This separates self-degradation (diagonal) from cross-subject interference (off-diagonal) and exposes interpretable failure modes—drift, swap, dominance, and blending—that holistic metrics like CLIP or FID miss.

Subject-driven image generation is increasingly expected to support fine-grained control over multiple entities within a single image. In multi-reference workflows, users may provide several subject images, a background reference, and long, entity-indexed prompts to control multiple people within one scene. In this setting, a key failure mode is cross-subject attribute misbinding: attributes are preserved, edited, or transferred to the wrong subject. Existing benchmarks and metrics largely emphasize holistic fidelity or per-subject self-similarity, making such failures hard to diagnose. We introduce MultiBind, a benchmark built from real multi-person photographs. Each instance provides slot-ordered subject crops with masks and bounding boxes, canonicalized subject references, an inpainted background reference, and a dense entity-indexed prompt derived from structured annotations. We also propose a dimension-wise confusion evaluation protocol that matches generated subjects to ground-truth slots and measures slot-to-slot similarity using specialists for face identity, appearance, pose, and expression. By subtracting the corresponding ground-truth similarity matrices, our method separates self-degradation from true cross-subject interference and exposes interpretable failure patterns such as drift, swap, dominance, and blending. Experiments on modern multi-reference generators show that MultiBind reveals binding failures that conventional reconstruction metrics miss.
0
cs.CV Hwasik Jeong, Seungryong Lee, Gyeongjin Kang et al. · Mar 22, 2026

This paper challenges the monolithic paradigm in pose-free feed-forward 3D Gaussian Splatting (3DGS), where a single network jointly estimates camera poses and synthesizes Gaussians. The authors propose 2Xplat, a modular two-expert framework that decouples geometry estimation (using Depth Anything 3) from appearance synthesis (using Multi-view Pyramid Transformer) via an explicit pose interface. The core claim is that separating these concerns enables superior training efficiency (<5K iterations) and novel-view synthesis quality competitive with posed methods, challenging the assumption that unified architectures are optimal.

Pose-free feed-forward 3D Gaussian Splatting (3DGS) has opened a new frontier for rapid 3D modeling, enabling high-quality Gaussian representations to be generated from uncalibrated multi-view images in a single forward pass. The dominant approach in this space adopts unified monolithic architectures, often built on geometry-centric 3D foundation models, to jointly estimate camera poses and synthesize 3DGS representations within a single network. While architecturally streamlined, such &#34;all-in-one&#34; designs may be suboptimal for high-fidelity 3DGS generation, as they entangle geometric reasoning and appearance modeling within a shared representation. In this work, we introduce 2Xplat, a pose-free feed-forward 3DGS framework based on a two-expert design that explicitly separates geometry estimation from Gaussian generation. A dedicated geometry expert first predicts camera poses, which are then explicitly passed to a powerful appearance expert that synthesizes 3D Gaussians. Despite its conceptual simplicity, being largely underexplored in prior works, the proposed approach proves highly effective. In fewer than 5K training iterations, the proposed two-experts pipeline substantially outperforms prior pose-free feed-forward 3DGS approaches and achieves performance on par with state-of-the-art posed methods. These results challenge the prevailing unified paradigm and suggest the potential advantages of modular design principles for complex 3D geometric estimation and appearance synthesis tasks.
0
cs.CV Wen Guo (1), Pengfei Zhao (1), Zongmeng Wang (4) et al. · Mar 23, 2026

Multi-Object Tracking (MOT) models often degrade during inference due to distribution shifts between training and test data. This paper proposes TCEI (Test-time Calibration from Experience and Intuition), a cognitive-inspired framework that uses transient memory for short-term guidance and accumulated experience for long-term calibration. Unlike traditional TTA methods that require backpropagation, TCEI operates entirely via forward propagation, adapting identity predictions in real-time without additional training.

Multiple Object Tracking (MOT) has long been a fundamental task in computer vision, with broad applications in various real-world scenarios. However, due to distribution shifts in appearance, motion pattern, and catagory between the training and testing data, model performance degrades considerably during online inference in MOT. Test-Time Adaptation (TTA) has emerged as a promising paradigm to alleviate such distribution shifts. However, existing TTA methods often fail to deliver satisfactory results in MOT, as they primarily focus solely on frame-level adaptation while neglecting temporal consistency and identity association across frames and videos. Inspired by human decision-making process, this paper propose a Test-time Calibration from Experience and Intuition (TCEI) framework. In this framework, the Intuitive system utilizes transient memory to recall recently observed objects for rapid predictions, while the Experiential system leverages the accumulated experience from prior test videos to reassess and calibrate these intuitive predictions. Furthermore, both confident and uncertain objects during online testing are exploited as historical priors and reflective cases, respectively, enabling the model to adapt to the testing environment and alleviate performance degradation. Extensive experiments demonstrate that the proposed TCEI framework consistently achieves superior performance across multiple benchmark datasets and significantly enhances the model's adaptability under distribution shifts. The code will be released at https://github.com/1941Zpf/TCEI.