Nothing here yet
UPPA introduces the first universal physical adversarial patch attack for infrared pedestrian detection, replacing costly instance-specific optimization with offline Particle Swarm Optimization over Bézier curve parameters. The method generates cold thermal patches that maintain topological stability under deformation while claiming zero online deployment overhead.
OpenEarth-Agent tackles the challenge of deploying autonomous Earth Observation (EO) agents in open environments characterized by diverse multi-modal data and heterogeneous tasks. Unlike existing tool-calling agents confined to closed environments with predefined tools, this work introduces a tool-creation paradigm where the agent adaptively generates specialized tools tailored to unseen data and tasks. The paper proposes a multi-agent architecture and OpenEarth-Bench (596 real-world cases across 7 domains) to evaluate this approach.
Vision-Language-Action models excel at direct visuomotor mapping but struggle with tasks requiring both fine-grained 3D spatial understanding and long-horizon logical planning. DualCoT-VLA proposes a parallel dual-stream reasoning mechanism that processes visual Chain-of-Thought for spatial perception and linguistic Chain-of-Thought for task planning simultaneously in latent space, using learnable query tokens to bypass autoregressive decoding and achieve single-step inference.
This paper proposes the Universal Normal Embedding (UNE) hypothesis: that generative models and vision encoders, despite different objectives, both approximate noisy linear projections of a shared Gaussian latent space. The authors argue that DDIM-inverted diffusion noise and encoder embeddings (CLIP, DINO) share this approximately Gaussian geometry, enabling linear semantic editing without architectural changes. They introduce NoiseZoo, a dataset of paired latents, to empirically test whether generative noise encodes semantic structure comparable to foundation encoders.
DTVI proposes a dual-stage inference-time defense for unsafe text-to-image generation. Unlike existing token-level interventions, it applies category-aware sequence-level embedding purification followed by visual feature suppression during denoising, aiming to block adversarial prompts that distribute malicious semantics across the full token sequence while maintaining benign generation quality.
Text-to-video concept erasure methods claim to remove sensitive content, but current evaluation only checks if the concept is absent from generated frames. PROBE introduces a diagnostic protocol that optimizes a pseudo-token embedding with frozen model weights to test whether erased concepts can be reactivated. By probing residual capacity across three architectures and three erasure strategies, the authors find that all tested methods leave measurable residual capacity and identify temporal re-emergence—a video-specific failure mode where concepts suppressed in early frames resurface later in the sequence.
This paper tackles the brittleness of static hyperparameters in visual odometry frontends by training an RL agent to dynamically tune feature detection and tracking parameters based on raw image content. The key insight is that conditioning decisions on visual appearance enables proactive adaptation to texture density, motion blur, and noise, embedding expert knowledge directly into the system.
This paper proposes Geometric Latent Diffusion (GLD), a novel framework for novel view synthesis (NVS) that repurposes the feature space of geometric foundation models (specifically Depth Anything 3) as the latent space for multi-view diffusion. Unlike conventional approaches that operate in view-independent VAE latent spaces, GLD leverages geometrically consistent features that natively encode cross-view correspondences, enabling both high-fidelity RGB reconstruction and zero-shot geometry decoding while accelerating training convergence by 4.4× compared to standard VAE spaces.
The paper tackles Fine-Grained Cross-View Geolocalization (FG-CVG), where the goal is to estimate the precise 2-DoF ground location of a camera given a ground-view image and a satellite map. Current approaches force a difficult accuracy-speed trade-off: high-precision models are too slow for real-time autonomous navigation. GeoFlow introduces a lightweight framework that learns a probabilistic regression field to predict displacement vectors (distance and direction) from arbitrary location hypotheses toward the ground truth. A novel Iterative Refinement Sampling (IRS) algorithm then refines multiple random hypotheses over several rounds to reach a robust consensus. The system claims to break the accuracy-speed barrier, achieving 29 FPS on an NVIDIA V100—significantly faster than competitors—while maintaining accuracy competitive with much heavier models.
This paper proposes a conditional video diffusion model trained on ERA5 reanalysis to synthesize the Madden-Julian Oscillation (MJO)—the dominant mode of tropical intraseasonal variability. The core innovation is "climate prompting," where low-dimensional physical indices (MJO phase/amplitude via RMM-PCs, seasonal cycles, ENSO state) serve as conditioning tokens to generate physically consistent high-dimensional atmospheric fields. The work bridges the gap between interpretable low-order climate theory and high-resolution generative models, enabling controlled experiments like perpetual MJOs or isolated seasonal modulations for hypothesis testing.
Reconstructing translucent objects from multi-view images is challenging because subsurface scattering causes standard surface reconstruction methods to fail. This paper proposes GTSR, a 3D Gaussian Splatting (3DGS) pipeline that separates surface geometry from scattering effects by using two Gaussian sets—surface Gaussians for geometry and interior Gaussians for scattering—blended via a Fresnel term. A physically-based rendering (PBR) module with deferred shading further constrains the geometry. The method achieves state-of-the-art surface reconstruction on the NeuralTO Syn dataset while training in approximately 2.5 hours, significantly faster than prior neural implicit approaches.
The paper addresses continual unlearning in Large Vision-Language Models (LVLMs), where models must sequentially remove specific vision-instruction pairs without full retraining while preserving general utility. Prior methods suffer from distorted shared representations that create spurious associations, leading to irrelevant refusals for past forget data and over-refusal of retain queries. The proposed framework, CORE (COncept-aware REfuser), decomposes deletion targets into fine-grained visual attributes and textual intents, using a concept modulator to identify which combinations characterize each forget category and a mixture of specialized refusal experts to generate contextually appropriate refusals.
Burst image restoration in low-light conditions typically relies on fixed exposure settings that limit complementary information across frames. This paper proposes DEBIR, a pipeline that dynamically predicts per-frame exposure times using a Burst Auto-Exposure Network (BAENet) conditioned on preview images, motion, and gain. The key insight is that scene-adaptive exposures can optimally trade off noise and blur across the burst, and the authors enable end-to-end training via a novel differentiable burst simulator that eliminates the need for ground-truth exposure sequences.
DA-VAE tackles the challenge of scaling latent diffusion models to higher resolutions without linearly increasing token counts. The core idea is a structured latent representation: keep the original pretrained VAE latent channels as a 'base' and append additional 'detail' channels that encode high-resolution information, enforced by a simple alignment loss. This allows a pretrained diffusion model to be fine-tuned rather than retrained from scratch, promising significant compute savings.
FeatDistill tackles robust detection of AI-generated images under real-world degradations via a multi-expert ensemble of CLIP and SigLIP backbones. The framework combines extensive data expansion with a two-stage training paradigm featuring feature-level self-distillation. It aims to balance strong generalization across unseen generators with practical inference efficiency.
This paper tackles coronary artery segmentation from CTA images, a challenging task due to slender tubular morphology and severe class imbalance. The authors propose MDSVM-UNet, a two-stage framework that combines multidirectional snake convolution (MDSConv)—extending deformable convolution to three anatomical planes—with residual visual Mamba (RVM) for linear-complexity long-range dependency modeling. The approach aims to capture both local geometric priors of vessels and global inter-slice context while maintaining computational efficiency suitable for clinical deployment.
This paper tackles a fundamental question in multimodal large language models (MLLMs): should the vision encoder be fine-tuned or frozen during instruction tuning? The authors identify visual preference conflicts—where diverse linguistic instructions pull encoder parameters in conflicting directions—as the root cause of instability in existing visual fine-tuning (VFT) methods. They propose CoVFT, a context-aware framework that extracts multimodal context vectors and routes visual tokens through mixture-of-experts layers to decompose these conflicts, achieving consistent gains across 12 benchmarks.
Federated learning for medical imaging typically requires task-specific pipelines and assumes homogeneous modalities across institutions, limiting real-world deployment where hospitals use diverse scanners (MRI, CT, PET) and need to support multiple downstream tasks. OmniFM proposes a frequency-domain insight: low-frequency spectral components exhibit cross-modality consistency and encode modality-invariant anatomical structures, enabling a single reusable optimization pipeline. The framework combines Global Spectral Knowledge Retrieval, Embedding-wise Cross-Attention Fusion, and Prefix-Suffix Spectral Prompting, regularized by Spectral-Proximal Alignment to stabilize aggregation under severe modality heterogeneity.
This paper proposes SparseVoxelDet, the first fully sparse object detector for event cameras that processes asynchronous event data using 3D sparse convolutions throughout the entire pipeline—from voxelization through backbone, feature pyramid, and detection head—without ever instantiating a dense feature tensor. On the FRED drone detection benchmark, the model achieves 83.38% mAP@50 (within 4.3 points of the dense YOLOv11 baseline) while processing only ~14,900 active voxels per frame (0.23% occupancy at 640×640) instead of all 409,600 pixel positions, yielding 858× GPU memory compression and storage costs that scale with scene activity rather than sensor resolution.
PEARL tackles training-free open-vocabulary semantic segmentation (OVSS), where the goal is to segment images into classes defined by arbitrary text prompts without fine-tuning the vision-language backbone. The core idea is an align-then-propagate pipeline: (1) Procrustes alignment rotates attention keys toward the query subspace inside the last self-attention block to fix spatially inconsistent patch geometry, and (2) a text-aware Laplacian propagation refines logits on a compact grid using a confidence-weighted graph that couples image gradients with text-based semantic similarity. This matters because it delivers state-of-the-art training-free accuracy with a frozen CLIP encoder, adding only modest computational overhead.