Nothing here yet
This paper evaluates whether recurrent temporal modeling helps event-based object detection in industrial settings. The authors benchmark ReYOLOv8s (a recurrent ConvLSTM-augmented detector) against a vanilla YOLOv8s baseline on MTEvent, an industrial warehouse/factory dataset with 17 classes and severe class imbalance. The key question is whether memory across temporal clip lengths (3-21 frames) improves detection over single-window baselines.
This paper addresses interactive text-to-image retrieval (I-TIR) where diffusion models generate visual proxies from dialogue, but static additive fusion of text and generated images introduces harmful noise. The core idea is ADaFuSE, a lightweight plug-in module combining adaptive gating (to dynamically weight modalities per instance) with a semantic-aware mixture-of-experts branch (to capture fine-grained cross-modal cues). The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55.62% of queries suffer degradation under static fusion.
This paper tackles the challenge of scaling reinforcement learning for long-horizon tool-using agents, where LLMs must orchestrate dozens of tool calls to satisfy multifaceted constraints. The authors propose STAR, a post-training pipeline that decomposes the RL design space across five axes—reward shaping, model scaling, data composition, algorithm selection, and environmental stability—to derive a practical, scale-aware recipe for training.
Long video understanding remains challenging for multimodal large language models due to limited context windows. VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal continuity. The framework propagates query relevance through an iterative hypothesis–verification–refinement loop, enabling sparse but informed sampling of critical segments for question answering.
Deep S2P modernizes the Satellite Stereo Pipeline (S2P) by replacing classical SGM and MGM correlators with contemporary learned matchers including FoundationStereo, MonSter, and StereoAnywhere. The core technical contribution adapts the rectification stage to enforce unipolar disparities with proper altitude consistency and disparity range constraints, enabling off-the-shelf deep networks to operate on satellite imagery. This matters for operational Earth observation because it delivers sharper Digital Surface Models with finer geometric detail, though the work also candidly exposes how standard metrics saturate and how vegetation remains a stubborn failure mode.
Group3D addresses open-vocabulary 3D object detection from multi-view RGB images by integrating semantic constraints directly into instance construction. Unlike prior work that merges fragments based solely on geometric consistency, it leverages a multimodal large language model to organize scene vocabularies into semantic compatibility groups that gate cross-view fragment association. This prevents irreversible over-merging when geometric evidence is incomplete, achieving state-of-the-art results on ScanNet and ARKitScenes in both pose-known and challenging pose-free zero-shot settings.
Artistic font generation seeks to transfer visual styles from reference images onto text glyphs while preserving readability. This paper proposes a paradigm shift from feature-fusion or adapter-based diffusion approaches to visual in-context generation, treating element images as pixel-level context for an inpainting model (FLUX.1-Fill). The core innovation lies in repurposing image inpainting as style transfer: element images are concatenated with a blank canvas, and the model fills glyph masks by propagating visual cues from the reference. This enables high-fidelity texture preservation and fine-grained control via a lightweight Context-aware Mask Adapter (CMA), supporting both object elements (structured) and amorphous elements (textures).
This paper addresses the critical challenge of detecting occult hemorrhage (internal bleeding) in intensive care units, where delayed diagnosis leads to preventable physiological shock and death. The authors develop a Bayesian regime switching model (RSM) that tracks five latent physiological states—including stable, hemorrhage, and recovery—using longitudinal vital signs (heart rate, MAP, hemoglobin, lactate) and medication history. Applied to 33,924 Mayo Clinic ICU encounters, the model aims to provide interpretable, probabilistic early warnings that outperform standard vital sign monitoring by accounting for autoregressive trends and pre-admission physiological changes.
This paper tackles the efficiency–generalization trade-off in Continual Test-Time Adaptation (CTTA), where models must adapt online to unlabeled streams under distribution shift without source data. The core insight is that feature updates need only occur within a low-rank "golden subspace" coinciding with the row space of the classifier. To avoid costly retraining, the authors propose using the Average Gradient Outer Product (AGOP) as an online proxy for the classifier weight structure, leading to the GOLD method that projects features onto this subspace and learns a compact scaling vector. If the theoretical claims hold under realistic nonlinear settings, this could significantly reduce deployment costs for adaptive systems.
UPPA introduces the first universal physical adversarial patch attack for infrared pedestrian detection, replacing costly instance-specific optimization with offline Particle Swarm Optimization over Bézier curve parameters. The method generates cold thermal patches that maintain topological stability under deformation while claiming zero online deployment overhead.
OpenEarth-Agent tackles the challenge of deploying autonomous Earth Observation (EO) agents in open environments characterized by diverse multi-modal data and heterogeneous tasks. Unlike existing tool-calling agents confined to closed environments with predefined tools, this work introduces a tool-creation paradigm where the agent adaptively generates specialized tools tailored to unseen data and tasks. The paper proposes a multi-agent architecture and OpenEarth-Bench (596 real-world cases across 7 domains) to evaluate this approach.
Vision-Language-Action models excel at direct visuomotor mapping but struggle with tasks requiring both fine-grained 3D spatial understanding and long-horizon logical planning. DualCoT-VLA proposes a parallel dual-stream reasoning mechanism that processes visual Chain-of-Thought for spatial perception and linguistic Chain-of-Thought for task planning simultaneously in latent space, using learnable query tokens to bypass autoregressive decoding and achieve single-step inference.
This paper proposes the Universal Normal Embedding (UNE) hypothesis: that generative models and vision encoders, despite different objectives, both approximate noisy linear projections of a shared Gaussian latent space. The authors argue that DDIM-inverted diffusion noise and encoder embeddings (CLIP, DINO) share this approximately Gaussian geometry, enabling linear semantic editing without architectural changes. They introduce NoiseZoo, a dataset of paired latents, to empirically test whether generative noise encodes semantic structure comparable to foundation encoders.
DTVI proposes a dual-stage inference-time defense for unsafe text-to-image generation. Unlike existing token-level interventions, it applies category-aware sequence-level embedding purification followed by visual feature suppression during denoising, aiming to block adversarial prompts that distribute malicious semantics across the full token sequence while maintaining benign generation quality.
Job recommender systems deployed by public employment services are typically optimized for predictive metrics like clicks, applications, or hires rather than job seeker welfare. This paper develops a structural job-search model where vacancy value depends on utility $U$ and hiring probability $p$, deriving a welfare-optimal ranking based on an expected-surplus index $\Gamma(p, U) = p \sigma \log(1 + e^{\Delta(p,U)/\sigma})$. Through two randomized field experiments with the French public employment service, the authors demonstrate that algorithms approximating this theoretical benchmark substantially outperform existing approaches, while formalizing the "inversion problem" where behavior-based rankings diverge from welfare-maximizing ones.
Text-to-video concept erasure methods claim to remove sensitive content, but current evaluation only checks if the concept is absent from generated frames. PROBE introduces a diagnostic protocol that optimizes a pseudo-token embedding with frozen model weights to test whether erased concepts can be reactivated. By probing residual capacity across three architectures and three erasure strategies, the authors find that all tested methods leave measurable residual capacity and identify temporal re-emergence—a video-specific failure mode where concepts suppressed in early frames resurface later in the sequence.
This paper tackles the brittleness of static hyperparameters in visual odometry frontends by training an RL agent to dynamically tune feature detection and tracking parameters based on raw image content. The key insight is that conditioning decisions on visual appearance enables proactive adaptation to texture density, motion blur, and noise, embedding expert knowledge directly into the system.
This paper proposes Geometric Latent Diffusion (GLD), a novel framework for novel view synthesis (NVS) that repurposes the feature space of geometric foundation models (specifically Depth Anything 3) as the latent space for multi-view diffusion. Unlike conventional approaches that operate in view-independent VAE latent spaces, GLD leverages geometrically consistent features that natively encode cross-view correspondences, enabling both high-fidelity RGB reconstruction and zero-shot geometry decoding while accelerating training convergence by 4.4× compared to standard VAE spaces.
The paper tackles Fine-Grained Cross-View Geolocalization (FG-CVG), where the goal is to estimate the precise 2-DoF ground location of a camera given a ground-view image and a satellite map. Current approaches force a difficult accuracy-speed trade-off: high-precision models are too slow for real-time autonomous navigation. GeoFlow introduces a lightweight framework that learns a probabilistic regression field to predict displacement vectors (distance and direction) from arbitrary location hypotheses toward the ground truth. A novel Iterative Refinement Sampling (IRS) algorithm then refines multiple random hypotheses over several rounds to reach a robust consensus. The system claims to break the accuracy-speed barrier, achieving 29 FPS on an NVIDIA V100—significantly faster than competitors—while maintaining accuracy competitive with much heavier models.
This paper proposes a conditional video diffusion model trained on ERA5 reanalysis to synthesize the Madden-Julian Oscillation (MJO)—the dominant mode of tropical intraseasonal variability. The core innovation is "climate prompting," where low-dimensional physical indices (MJO phase/amplitude via RMM-PCs, seasonal cycles, ENSO state) serve as conditioning tokens to generate physically consistent high-dimensional atmospheric fields. The work bridges the gap between interpretable low-order climate theory and high-resolution generative models, enabling controlled experiments like perpetual MJOs or isolated seasonal modulations for hypothesis testing.