Feed - arxlens

0

ACPO: Counteracting Likelihood Displacement in Vision-Language Alignment with Asymmetric Constraints

cs.CV Kaili Huang, Hongming Zhang, Rui Shen et al. · Mar 23, 2026

Direct Preference Optimization (DPO) for Vision-Language Models suffers from Likelihood Displacement, where optimization collapses the probabilities of both chosen and rejected responses, causing models to abandon visual evidence for language priors. This paper proposes Asymmetric Constrained Preference Optimization (ACPO), which applies dynamic, length-aware scaling exclusively to the rejected reward term, preserving the chosen distribution as a stable anchor while selectively suppressing incorrect outputs.

While Direct Preference Optimization (DPO) has become the de facto approach for aligning Large Vision-Language Models (LVLMs), it suffers from Likelihood Displacement, where the probability of both chosen and rejected responses collapses. This optimization flaw is especially detrimental in multimodal settings: the erosion of chosen likelihoods -- a failure we term Visual Anchor Collapse -- causes models to abandon visual evidence for strong language priors, precipitating significant hallucinations. To address this, we propose Asymmetric Constrained Preference Optimization (ACPO), a modality-agnostic alignment mechanism that applies dynamic, target-oriented scaling to preference optimization. ACPO derives a complexity-aware scaling coefficient applied exclusively to the rejected reward, asymmetrically suppressing the gradient flow on the rejected term while preserving the chosen distribution as a gradient-stable reference. While fundamentally a general-purpose objective, breaking this gradient symmetry is crucial for multimodal tasks, as it mitigates the suppression of visual tokens by language priors. Experiments on InternVL models demonstrate that ACPO effectively reverses the chosen-reward degradation of standard DPO. By halting Visual Anchor Collapse, ACPO generally outperforms baselines on hallucination benchmarks (HallusionBench, MM-IFEval) and general leaderboards (MMBench, MMStar, OCRBenchV2) while driving concurrent improvements in general capabilities.

Read abstractHide abstract

0

Benchmarking Recurrent Event-Based Object Detection for Industrial Multi-Class Recognition on MTEvent

cs.CV Lokeshwaran Manohar, Moritz Roidl · Mar 23, 2026

This paper evaluates whether recurrent temporal modeling helps event-based object detection in industrial settings. The authors benchmark ReYOLOv8s (a recurrent ConvLSTM-augmented detector) against a vanilla YOLOv8s baseline on MTEvent, an industrial warehouse/factory dataset with 17 classes and severe class imbalance. The key question is whether memory across temporal clip lengths (3-21 frames) improves detection over single-window baselines.

Event cameras are attractive for industrial robotics because they provide high temporal resolution, high dynamic range, and reduced motion blur. However, most event-based object detection studies focus on outdoor driving scenarios or limited class settings. In this work, we benchmark recurrent ReYOLOv8s on MTEvent for industrial multi-class recognition and use a non-recurrent YOLOv8s variant as a baseline to analyze the effect of temporal memory. On the MTEvent validation split, the best scratch recurrent model (C21) reaches 0.285 mAP50, corresponding to a 9.6% relative improvement over the nonrecurrent YOLOv8s baseline (0.260). Event-domain pretraining has a stronger effect: GEN1-initialized fine-tuning yields the best overall result of 0.329 mAP50 at clip length 21, and unlike scratch training, GEN1-pretrained models improve consistently with clip length. PEDRo initialization drops to 0.251, indicating that mismatched source-domain pretraining can be less effective than training from scratch. Persistent failure modes are dominated by class imbalance and human-object interaction. Overall, we position this work as a focused benchmarking and analysis study of recurrent event-based detection in industrial environments.

Read abstractHide abstract

0

ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

cs.IR cs.CV Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang et al. · Mar 23, 2026

This paper addresses interactive text-to-image retrieval (I-TIR) where diffusion models generate visual proxies from dialogue, but static additive fusion of text and generated images introduces harmful noise. The core idea is ADaFuSE, a lightweight plug-in module combining adaptive gating (to dynamically weight modalities per instance) with a semantic-aware mixture-of-experts branch (to capture fine-grained cross-modal cues). The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55.62% of queries suffer degradation under static fusion.

Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.

Read abstractHide abstract

0

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

cs.LG cs.CL Xixi Wu, Qianguo Sun, Ruiyang Zhang et al. · Mar 23, 2026

This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.

Read abstractHide abstract

0

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

cs.CV Ruoliu Yang, Chu Wu, Caifeng Shan et al. · Mar 23, 2026

Long video understanding remains challenging for multimodal large language models due to limited context windows. VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal continuity. The framework propagates query relevance through an iterative hypothesis–verification–refinement loop, enabling sparse but informed sampling of critical segments for question answering.

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/

Read abstractHide abstract

0

Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo Pipeline

cs.CV El\'ias Masquil, Thibaud Ehret, Pablo Mus\'e et al. · Mar 23, 2026

Deep S2P modernizes the Satellite Stereo Pipeline (S2P) by replacing classical SGM and MGM correlators with contemporary learned matchers including FoundationStereo, MonSter, and StereoAnywhere. The core technical contribution adapts the rectification stage to enforce unipolar disparities with proper altitude consistency and disparity range constraints, enabling off-the-shelf deep networks to operate on satellite imagery. This matters for operational Earth observation because it delivers sharper Digital Surface Models with finer geometric detail, though the work also candidly exposes how standard metrics saturate and how vegetation remains a stubborn failure mode.

Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.

Read abstractHide abstract

0

Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

cs.CV Youbin Kim, Jinho Park, Hogun Park et al. · Mar 23, 2026

Group3D addresses open-vocabulary 3D object detection from multi-view RGB images by integrating semantic constraints directly into instance construction. Unlike prior work that merges fragments based solely on geometric consistency, it leverages a multimodal large language model to organize scene vocabularies into semantic compatibility groups that gate cross-view fragment association. This prevents irreversible over-merging when geometric evidence is incomplete, achieving state-of-the-art results on ScanNet and ARKitScenes in both pose-known and challenging pose-free zero-shot settings.

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.

Read abstractHide abstract

0

FontCrafter: High-Fidelity Element-Driven Artistic Font Creation with Visual In-Context Generation

cs.CV Wuyang Luo, Chengkai Tan, Chang Ge et al. · Mar 23, 2026

Artistic font generation seeks to transfer visual styles from reference images onto text glyphs while preserving readability. This paper proposes a paradigm shift from feature-fusion or adapter-based diffusion approaches to visual in-context generation, treating element images as pixel-level context for an inpainting model (FLUX.1-Fill). The core innovation lies in repurposing image inpainting as style transfer: element images are concatenated with a blank canvas, and the model fills glyph masks by propagating visual cues from the reference. This enables high-fidelity texture preservation and fine-grained control via a lightweight Context-aware Mask Adapter (CMA), supporting both object elements (structured) and amorphous elements (textures).

Artistic font generation aims to synthesize stylized glyphs based on a reference style. However, existing approaches suffer from limited style diversity and coarse control. In this work, we explore the potential of element-driven artistic font generation. Elements are the fundamental visual units of a font, serving as reference images for the desired style. Conceptually, we categorize elements into object elements (e.g., flowers or stones) with distinct structures and amorphous elements (e.g., flames or clouds) with unstructured textures. We introduce FontCrafter, an element-driven framework for font creation, and construct a large-scale dataset, ElementFont, which contains diverse element types and high-quality glyph images. However, achieving high-fidelity reconstruction of both texture and structure of reference elements remains challenging. To address this, we propose an in-context generation strategy that treats element images as visual context and uses an inpainting model to transfer element styles into glyph regions at the pixel level. To further control glyph shapes, we design a lightweight Context-aware Mask Adapter (CMA) that injects shape information. Moreover, a training-free attention redirection mechanism enables region-aware style control and suppresses stroke hallucination. In addition, edge repainting is applied to make boundaries more natural. Extensive experiments demonstrate that FontCrafter achieves strong zero-shot generation performance, particularly in preserving structural and textural fidelity, while also supporting flexible controls such as style mixture.

Read abstractHide abstract

0

Identification of physiological shock in intensive care units via Bayesian regime switching models

stat.AP stat.ME stat.ML Emmett B. Kendall, Jonathan P. Williams, Curtis B. Storlie et al. · Mar 23, 2026

This paper addresses the critical challenge of detecting occult hemorrhage (internal bleeding) in intensive care units, where delayed diagnosis leads to preventable physiological shock and death. The authors develop a Bayesian regime switching model (RSM) that tracks five latent physiological states—including stable, hemorrhage, and recovery—using longitudinal vital signs (heart rate, MAP, hemoglobin, lactate) and medication history. Applied to 33,924 Mayo Clinic ICU encounters, the model aims to provide interpretable, probabilistic early warnings that outperform standard vital sign monitoring by accounting for autoregressive trends and pre-admission physiological changes.

Detection of occult hemorrhage (i.e., internal bleeding) in patients in intensive care units (ICUs) can pose significant challenges for critical care workers. Because blood loss may not always be clinically apparent, clinicians rely on monitoring vital signs for specific trends indicative of a hemorrhage event. The inherent difficulties of diagnosing such an event can lead to late intervention by clinicians which has catastrophic consequences. Therefore, a methodology for early detection of hemorrhage has wide utility. We develop a Bayesian regime switching model (RSM) that analyzes trends in patients' vitals and labs to provide a probabilistic assessment of the underlying physiological state that a patient is in at any given time. This article is motivated by a comprehensive dataset we curated from Mayo Clinic of 33,924 real ICU patient encounters. Longitudinal response measurements are modeled as a vector autoregressive process conditional on all latent states up to the current time point, and the latent states follow a Markov process. We present a novel Bayesian sampling routine to learn the posterior probability distribution of the latent physiological states, as well as develop an approach to account for pre-ICU-admission physiological changes. A simulation and real case study illustrate the effectiveness of our approach.

Read abstractHide abstract

0

The Golden Subspace: Where Efficiency Meets Generalization in Continual Test-Time Adaptation

cs.CV cs.LG Guannan Lai, Da-Wei Zhou, Zhenguo Li et al. · Mar 23, 2026

This paper tackles the efficiency–generalization trade-off in Continual Test-Time Adaptation (CTTA), where models must adapt online to unlabeled streams under distribution shift without source data. The core insight is that feature updates need only occur within a low-rank "golden subspace" coinciding with the row space of the classifier. To avoid costly retraining, the authors propose using the Average Gradient Outer Product (AGOP) as an online proxy for the classifier weight structure, leading to the GOLD method that projects features onto this subspace and learns a compact scaling vector. If the theoretical claims hold under realistic nonlinear settings, this could significantly reduce deployment costs for adaptive systems.

Continual Test-Time Adaptation (CTTA) aims to enable models to adapt online to unlabeled data streams under distribution shift without accessing source data. Existing CTTA methods face an efficiency-generalization trade-off: updating more parameters improves adaptation but severely reduces online inference efficiency. An ideal solution is to achieve comparable adaptation with minimal feature updates; we call this minimal subspace the golden subspace. We prove its existence in a single-step adaptation setting and show that it coincides with the row space of the pretrained classifier. To enable online maintenance of this subspace, we introduce the sample-wise Average Gradient Outer Product (AGOP) as an efficient proxy for estimating the classifier weights without retraining. Building on these insights, we propose Guided Online Low-rank Directional adaptation (GOLD), which uses a lightweight adapter to project features onto the golden subspace and learns a compact scaling vector while the subspace is dynamically updated via AGOP. Extensive experiments on classification and segmentation benchmarks, including autonomous-driving scenarios, demonstrate that GOLD attains superior efficiency, stability, and overall performance. Our code is available at https://github.com/AIGNLAI/GOLD.

Read abstractHide abstract

0

Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

cs.CV Chengyin Hu, Yikun Guo, Yuxian Dong et al. · Mar 23, 2026

UPPA introduces the first universal physical adversarial patch attack for infrared pedestrian detection, replacing costly instance-specific optimization with offline Particle Swarm Optimization over Bézier curve parameters. The method generates cold thermal patches that maintain topological stability under deformation while claiming zero online deployment overhead.

Although infrared pedestrian detectors have been widely deployed in visual perception tasks, their vulnerability to physical adversarial attacks is becoming increasingly apparent. Existing physical attack methods predominantly rely on instance-specific online optimization and rigid pattern design, leading to high deployment costs and insufficient physical robustness. To address these limitations, this work proposes the Universal Physical Patch Attack (UPPA), the first universal physical attack method in the infrared domain. This method employs geometrically constrained parameterized Bezier blocks to model perturbations and utilizes the Particle Swarm Optimization (PSO) algorithm to perform unified optimization across the global data distribution, thus maintaining topological stability under dynamic deformations. In the physical deployment phase, we materialize the optimized digital perturbations into physical cold patches, achieving a continuous and smooth low-temperature distribution that naturally aligns with the thermal radiation characteristics of infrared imaging. Extensive experiments demonstrate that UPPA achieves an outstanding physical attack success rate without any online computational overhead, while also exhibiting strong cross-domain generalization and reliable black-box transferability.

Read abstractHide abstract

0

OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation

cs.CV Sijie Zhao, Feng Liu, Xueliang Zhang et al. · Mar 23, 2026

OpenEarth-Agent tackles the challenge of deploying autonomous Earth Observation (EO) agents in open environments characterized by diverse multi-modal data and heterogeneous tasks. Unlike existing tool-calling agents confined to closed environments with predefined tools, this work introduces a tool-creation paradigm where the agent adaptively generates specialized tools tailored to unseen data and tasks. The paper proposes a multi-agent architecture and OpenEarth-Bench (596 real-world cases across 7 domains) to evaluate this approach.

Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents' adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.

Read abstractHide abstract

0

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

cs.CV cs.RO Zhide Zhong, Junfeng Li, Junjie He et al. · Mar 23, 2026

Vision-Language-Action models excel at direct visuomotor mapping but struggle with tasks requiring both fine-grained 3D spatial understanding and long-horizon logical planning. DualCoT-VLA proposes a parallel dual-stream reasoning mechanism that processes visual Chain-of-Thought for spatial perception and linguistic Chain-of-Thought for task planning simultaneously in latent space, using learnable query tokens to bypass autoregressive decoding and achieve single-step inference.

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained spatial perception. Recent efforts have incorporated Chain-of-Thought (CoT) reasoning to endow VLA models with a ``thinking before acting'' capability. However, current CoT-based VLA models face two critical limitations: 1) an inability to simultaneously capture low-level visual details and high-level logical planning due to their reliance on isolated, single-modal CoT; 2) high inference latency with compounding errors caused by step-by-step autoregressive decoding. To address these limitations, we propose DualCoT-VLA, a visual-linguistic CoT method for VLA models with a parallel reasoning mechanism. To achieve comprehensive multi-modal reasoning, our method integrates a visual CoT for low-level spatial understanding and a linguistic CoT for high-level task planning. Furthermore, to overcome the latency bottleneck, we introduce a parallel CoT mechanism that incorporates two sets of learnable query tokens, shifting autoregressive reasoning to single-step forward reasoning. Extensive experiments demonstrate that our DualCoT-VLA achieves state-of-the-art performance on the LIBERO and RoboCasa GR1 benchmarks, as well as in real-world platforms.

Read abstractHide abstract

0

The Universal Normal Embedding

cs.CV eess.IV Chen Tasker, Roy Betser, Eyal Gofer et al. · Mar 23, 2026

This paper proposes the Universal Normal Embedding (UNE) hypothesis: that generative models and vision encoders, despite different objectives, both approximate noisy linear projections of a shared Gaussian latent space. The authors argue that DDIM-inverted diffusion noise and encoder embeddings (CLIP, DINO) share this approximately Gaussian geometry, enabling linear semantic editing without architectural changes. They introduce NoiseZoo, a dataset of paired latents, to empirically test whether generative noise encodes semantic structure comparable to foundation encoders.

Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian latent space from which encoder embeddings and DDIM-inverted noise arise as noisy linear projections. To test our hypothesis, we introduce NoiseZoo, a dataset of per-image latents comprising DDIM-inverted diffusion noise and matching encoder representations (CLIP, DINO). On CelebA, linear probes in both spaces yield strong, aligned attribute predictions, indicating that generative noise encodes meaningful semantics along linear directions. These directions further enable faithful, controllable edits (e.g., smile, gender, age) without architectural changes, where simple orthogonalization mitigates spurious entanglements. Taken together, our results provide empirical support for the UNE hypothesis and reveal a shared Gaussian-like latent geometry that concretely links encoding and generation. Code and data are available https://rbetser.github.io/UNE/

Read abstractHide abstract

0

DTVI: Dual-Stage Textual and Visual Intervention for Safe Text-to-Image Generation

cs.CV Binhong Tan, Zhaoxin Wang, Handing Wang · Mar 23, 2026

DTVI proposes a dual-stage inference-time defense for unsafe text-to-image generation. Unlike existing token-level interventions, it applies category-aware sequence-level embedding purification followed by visual feature suppression during denoising, aiming to block adversarial prompts that distribute malicious semantics across the full token sequence while maintaining benign generation quality.

Text-to-Image (T2I) diffusion models have demonstrated strong generation ability, but their potential to generate unsafe content raises significant safety concerns. Existing inference-time defense methods typically perform category-agnostic token-level intervention in the text embedding space, which fails to capture malicious semantics distributed across the full token sequence and remains vulnerable to adversarial prompts. In this paper, we propose DTVI, a dual-stage inference-time defense framework for safe T2I generation. Unlike existing methods that intervene on specific token embeddings, our method introduces category-aware sequence-level intervention on the full prompt embedding to better capture distributed malicious semantics, and further attenuates the remaining unsafe influences during the visual generation stage. Experimental results on real-world unsafe prompts, adversarial prompts, and multiple harmful categories show that our method achieves effective and robust defense while preserving reasonable generation quality on benign prompts, obtaining an average Defense Success Rate (DSR) of 94.43% across sexual-category benchmarks and 88.56 across seven unsafe categories, while maintaining generation quality on benign prompts.

Read abstractHide abstract

0

A Job I Like or a Job I Can Get: Designing Job Recommender Systems Using Field Experiments

econ.EM stat.ML Guillaume Bied, Philippe Caillou, Bruno Cr\'epon et al. · Mar 23, 2026

Job recommender systems deployed by public employment services are typically optimized for predictive metrics like clicks, applications, or hires rather than job seeker welfare. This paper develops a structural job-search model where vacancy value depends on utility $U$ and hiring probability $p$, deriving a welfare-optimal ranking based on an expected-surplus index $\Gamma(p, U) = p \sigma \log(1 + e^{\Delta(p,U)/\sigma})$. Through two randomized field experiments with the French public employment service, the authors demonstrate that algorithms approximating this theoretical benchmark substantially outperform existing approaches, while formalizing the "inversion problem" where behavior-based rankings diverge from welfare-maximizing ones.

Recommendation systems (RSs) are increasingly used to guide job seekers on online platforms, yet the algorithms currently deployed are typically optimized for predictive objectives such as clicks, applications, or hires, rather than job seekers' welfare. We develop a job-search model with an application stage in which the value of a vacancy depends on two dimensions: the utility it delivers to the worker and the probability that an application succeeds. The model implies that welfare-optimal RSs rank vacancies by an expected-surplus index combining both, and shows why rankings based solely on utility, hiring probabilities, or observed application behavior are generically suboptimal, an instance of the inversion problem between behavior and welfare. We test these predictions and quantify their practical importance through two randomized field experiments conducted with the French public employment service. The first experiment, comparing existing algorithms and their combinations, provides behavioral evidence that both dimensions shape application decisions. Guided by the model and these results, the second experiment extends the comparison to an RS designed to approximate the welfare-optimal ranking. The experiments generate exogenous variation in the vacancies shown to job seekers, allowing us to estimate the model, validate its behavioral predictions, and construct a welfare metric. Algorithms informed by the model-implied optimal ranking substantially outperform existing approaches and perform close to the welfare-optimal benchmark. Our results show that embedding predictive tools within a simple job-search framework and combining it with experimental evidence yields recommendation rules with substantial welfare gains in practice.

Read abstractHide abstract

0

PROBE: Diagnosing Residual Concept Capacity in Erased Text-to-Video Diffusion Models

cs.CV Yiwei Xie, Zheng Zhang, Ping Liu · Mar 23, 2026

Text-to-video concept erasure methods claim to remove sensitive content, but current evaluation only checks if the concept is absent from generated frames. PROBE introduces a diagnostic protocol that optimizes a pseudo-token embedding with frozen model weights to test whether erased concepts can be reactivated. By probing residual capacity across three architectures and three erasure strategies, the authors find that all tested methods leave measurable residual capacity and identify temporal re-emergence—a video-specific failure mode where concepts suppressed in early frames resurface later in the sequence.

Concept erasure techniques for text-to-video (T2V) diffusion models report substantial suppression of sensitive content, yet current evaluation is limited to checking whether the target concept is absent from generated frames, treating output-level suppression as evidence of representational removal. We introduce PROBE, a diagnostic protocol that quantifies the \textit{reactivation potential} of erased concepts in T2V models. With all model parameters frozen, PROBE optimizes a lightweight pseudo-token embedding through a denoising reconstruction objective combined with a novel latent alignment constraint that anchors recovery to the spatiotemporal structure of the original concept. We make three contributions: (1) a multi-level evaluation framework spanning classifier-based detection, semantic similarity, temporal reactivation analysis, and human validation; (2) systematic experiments across three T2V architectures, three concept categories, and three erasure strategies revealing that all tested methods leave measurable residual capacity whose robustness correlates with intervention depth; and (3) the identification of temporal re-emergence, a video-specific failure mode where suppressed concepts progressively resurface across frames, invisible to frame-level metrics. These findings suggest that current erasure methods achieve output-level suppression rather than representational removal. We release our protocol to support reproducible safety auditing. Our code is available at https://github.com/YiweiXie/PRObingBasedEvaluation.

Read abstractHide abstract

0

Image-Conditioned Adaptive Parameter Tuning for Visual Odometry Frontends

cs.CV Simone Nascivera, Leonard Bauersfeld, Jeff Delaune et al. · Mar 23, 2026

This paper tackles the brittleness of static hyperparameters in visual odometry frontends by training an RL agent to dynamically tune feature detection and tracking parameters based on raw image content. The key insight is that conditioning decisions on visual appearance enables proactive adaptation to texture density, motion blur, and noise, embedding expert knowledge directly into the system.

Resource-constrained autonomous robots rely on sparse direct and semi-direct visual-(inertial)-odometry (VO) pipelines, as they provide a favorable tradeoff between accuracy, robustness, and computational cost. However, the performance of most systems depends critically on hand-tuned hyperparameters governing feature detection, tracking, and outlier rejection. These parameters are typically fixed during deployment, even though their optimal values vary with scene characteristics such as texture density, illumination, motion blur, and sensor noise, leading to brittle performance in real-world environments. We propose the first image-conditioned reinforcement learning framework for online tuning of VO frontend parameters, effectively embedding the expert into the system. Our key idea is to formulate the frontend configuration as a sequential decision-making problem and learn a policy that directly maps visual input to feature detection and tracking parameters. The policy uses a lightweight texture-aware CNN encoder and a privileged critic during training. Unlike prior RL-based approaches that rely solely on internal VO statistics, our method observes the image content and proactively adapts parameters before tracking degrades. Experiments on TartanAirV2 and TUM RGB-D show 3x longer feature tracks and 3x lower computational cost, despite training entirely in simulation.

Read abstractHide abstract

0

Repurposing Geometric Foundation Models for Multi-view Diffusion

cs.CV Wooseok Jang, Seonghu Jeon, Jisang Han et al. · Mar 23, 2026

This paper proposes Geometric Latent Diffusion (GLD), a novel framework for novel view synthesis (NVS) that repurposes the feature space of geometric foundation models (specifically Depth Anything 3) as the latent space for multi-view diffusion. Unlike conventional approaches that operate in view-independent VAE latent spaces, GLD leverages geometrically consistent features that natively encode cross-view correspondences, enabling both high-fidelity RGB reconstruction and zero-shot geometry decoding while accelerating training convergence by 4.4× compared to standard VAE spaces.

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.

Read abstractHide abstract

0

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

cs.CV Ayesh Abu Lehyeh, Xiaohan Zhang, Ahmad Arrabi et al. · Mar 23, 2026

The paper tackles Fine-Grained Cross-View Geolocalization (FG-CVG), where the goal is to estimate the precise 2-DoF ground location of a camera given a ground-view image and a satellite map. Current approaches force a difficult accuracy-speed trade-off: high-precision models are too slow for real-time autonomous navigation. GeoFlow introduces a lightweight framework that learns a probabilistic regression field to predict displacement vectors (distance and direction) from arbitrary location hypotheses toward the ground truth. A novel Iterative Refinement Sampling (IRS) algorithm then refines multiple random hypotheses over several rounds to reach a robust consensus. The system claims to break the accuracy-speed barrier, achieving 29 FPS on an NVIDIA V100—significantly faster than competitors—while maintaining accuracy competitive with much heavier models.

Accurate and fast localization is vital for safe autonomous navigation in GPS-denied areas. Fine-Grained Cross-View Geolocalization (FG-CVG) aims to estimate the precise 2-Degree-of-Freedom (2-DoF) location of a ground image relative to a satellite image. However, current methods force a difficult trade-off, with high-accuracy models being slow for real-time use. In this paper, we introduce GeoFlow, a new approach that offers a lightweight and highly efficient framework that breaks this accuracy-speed trade-off. Our technique learns a direct probabilistic mapping, predicting the displacement (in distance and direction) required to correct any given location hypothesis. This is complemented by our novel inference algorithm, Iterative Refinement Sampling (IRS). Instead of trusting a single prediction, IRS refines a population of hypotheses, allowing them to iteratively 'flow' from random starting points to a robust, converged consensus. Even its iterative nature, this approach offers flexible inference-time scaling, allowing a direct trade-off between performance and computation without any re-training. Experiments on the KITTI and VIGOR datasets show that GeoFlow achieves state-of-the-art efficiency, running at real-time speeds of 29 FPS while maintaining competitive localization accuracy. This work opens a new path for the development of practical real-time geolocalization systems.

Read abstractHide abstract

Nothing here yet