Nothing here yet
DUO-VSR tackles the prohibitive sampling cost of diffusion-based video super-resolution by enabling efficient one-step generation. The paper identifies critical limitations when applying Distribution Matching Distillation (DMD) to VSR—specifically training instability, degraded supervision from frozen score models, and insufficient guidance capped by teacher quality—and proposes a dual-stream strategy that unifies DMD with adversarial supervision via Real–Fake Score Feature GAN (RFS-GAN). This three-stage pipeline achieves approximately $50\times$ speedup over multi-step counterparts while delivering superior perceptual quality, making high-fidelity video upscaling practical for real-world deployment.
ALADIN tackles person Re-identification by distilling fine-grained attribute knowledge from a frozen CLIP teacher into a lightweight student network. The core innovation uses a Multimodal LLM (Qwen-VL) to generate structured attribute descriptions, which are converted via CLIP into spatial attention maps for supervising local feature alignment. A Scene-Aware Prompt Generator (SAPG) creates image-specific soft prompts via $\mathbf{p}=\mathrm{MLP}(\mathbf{f}_{g})$ to adapt text embeddings to surveillance scenes. At inference, only the student runs, promising deployable efficiency.
NoOVD tackles a critical issue in open-vocabulary object detection (OVD): during training, novel-category objects are forcibly aligned with background embeddings, causing them to be filtered out by the RPN and misclassified by the RoI head. The authors propose a framework built on frozen CLIP that identifies latent novel objects during training via generic text prompts (e.g., 'This is an object, specifically an animal') and integrates them through self-distillation. At test time, a Re-weighted RPN (R-RPN) boosts proposal scores using CLIP-based knowledge to improve novel-category recall. The method aims to eliminate the training-inference gap without requiring additional labeled data or pseudo-labeling noise.
This paper investigates whether video transformers can detect respiratory distress from video recordings of post-exercise recovery. The authors frame the problem as a temporal ordering task—predicting which of two clips shows greater shortness of breath—and propose augmenting ViViT with Lie Relative Encodings (LieRE) and Motion-Guided Masking (MGM). An F1 score of 0.81 is achieved, though on only 7 test videos from 3 participants.
Remote sensing text-to-image generation suffers from a lack of domain-specific diffusion transformers and prohibitive costs for high-resolution training. Existing training-free resolution promotion methods apply static RoPE scaling that uniformly compresses the spatial spectrum, which is particularly harmful for RS imagery due to its characteristically denser high-frequency energy. This paper proposes SHARP, a spectrum-aware dynamic adaptation strategy that uses a rational decay scheduler $\kappa_{rs}(t)$ to apply strong positional extrapolation early in denoising (for layout formation) while progressively relaxing it later (for detail recovery). The approach enables robust multi-scale generation up to 2.5$\times$ extrapolation factors with negligible overhead, addressing a critical gap in large-scale RS synthesis.
Accurate respiratory motion modeling is critical for radiotherapy precision, yet patient-specific breathing patterns are difficult to predict outside observed ranges. This paper proposes PRISM-RM, a trajectory-aware implicit neural representation (INR) that models lung motion as a continuous diffeomorphic flow driven by external surrogate signals. By integrating neohookean hyperelastic constraints with temporal total-variation regularization, the method eliminates the need for fixed reference breathing states and aims to improve extrapolation to unseen respiratory phases.
Pointing-based methods improve Large Vision-Language Models (LVLMs) by grounding objects before answering, yet the underlying mechanism remains unclear. This work investigates why pointing helps by comparing Direct Counting against Point-then-Count (PtC) in zero-shot counting tasks using synthetic data with controlled spatial layouts. The authors find that intermediate coordinate supervision encourages skill learning rather than narrow task memorization, yielding stronger out-of-distribution generalization while providing verifiable visual explanations.
Diffusion models generate high-quality images but require hundreds of denoising steps, making deployment on edge devices impractical. This paper proposes Coarse-to-Fine Diffusion Models that start with low-resolution denoising early in the process (when outputs are noisy anyway) before switching to high-resolution, plus a fast time-step search method that finds good sampling schedules in under 10 minutes instead of days.
Generative zero-shot learning (ZSL) synthesizes visual features for unseen classes conditioned on semantic prototypes, but existing methods often produce task-agnostic features that overlap for semantically similar yet visually distinct categories. This paper proposes RLVC, an outcome-reward reinforcement learning framework that treats the feature generator as a policy model and optimizes it using classifier confidence as the reward signal. The method further incorporates class-wise visual prototypes via a distillation loss to align synthesized features with real data distributions, achieving reported state-of-the-art results on CUB, SUN, and AWA2 benchmarks.
This paper tackles real-world image restoration (Real-IR) by adapting the 12B-parameter FLUX.1-dev flow matching model to low-level vision tasks. The core innovation is ResFlow-Tuner, which combines Unified Multi-Modal Fusion (UMMF) of image and text cues with a novel test-time scaling (TTS) paradigm that greedily optimizes ODE sampling trajectories using a multi-reward ensemble during inference. This establishes a new compute-quality trade-off for generative image restoration, showing that carefully perturbing intermediate flow states can yield substantial perceptual gains without retraining the base model.
The paper tackles the computational bottleneck of radiative transfer models (RTMs) for hyperspectral image (HSI) generation by proposing a VAE-based emulation framework that learns latent representations conditioned on biophysical parameters. It introduces both pixel-to-pixel (P2P) and fully convolutional (FC-VAE) variants, trained via either direct one-step mapping or a two-step pretraining strategy that decouples representation learning from parameter-to-latent interpolation. The work is significant for remote sensing applications as it provides empirical evidence that optimal emulator architecture depends critically on whether the target data is simulated (where P2P excels) or real-world imagery (where FC-VAE-pre dominates), and demonstrates that emulated data preserves downstream utility for parameter retrieval tasks.
Optical flow estimation traditionally requires expensive ground-truth annotations or relies on unreliable brightness constancy assumptions that fail under occlusion and illumination changes. This paper introduces GenOpticalFlow, a framework that synthesizes perfectly aligned training pairs by using monocular depth estimates to generate pseudo-optical flow, then conditioning a latent diffusion model to render corresponding next frames. The core innovation is converting unsupervised optical flow learning into a supervised training paradigm using synthetic data with geometrically consistent motion fields, potentially eliminating the need for manual annotation at scale.
StreamingClaw addresses real-time streaming video understanding for embodied intelligence applications such as autonomous driving and robotics. The framework unifies continuous perception, hierarchical multimodal memory, and proactive interaction through a main–sub-agent architecture where StreamingReasoning orchestrates StreamingMemory and StreamingProactivity sub-agents. By integrating incremental KV-cache reuse with dynamic pruning, memory evolution from atomic actions to events, and trigger-based proactive responses, it aims to close the perception–decision–action loop for physical world deployment.
VIGIL tackles hallucination in multimodal deepfake detection by decoupling claim generation from evidence sourcing through a part-centric plan-then-examine pipeline. The framework first plans which facial parts to inspect using global visual cues, then examines each part with independently sourced forensic evidence delivered via a stage-gated injection mechanism. Combined with a progressive three-stage training paradigm featuring part-aware reinforcement learning rewards, the method aims to produce verifiable, anatomically grounded explanations rather than confabulated reasoning chains.
MultiBind targets a critical blind spot in evaluating multi-subject image generators: cross-subject attribute misbinding, where models assign jackets, smiles, or poses to the wrong person. The benchmark grounds each test case in a real photograph (508 instances, 2–4 human subjects each) and provides slot-ordered crops, masks, background references, and long entity-indexed prompts (~474 words). Its core technical idea is the delta-matrix evaluation: for each attribute dimension $d$, compute $\Delta^{(d)} = S_{\mathrm{gen}}^{(d)} - S_{\mathrm{gt}}^{(d)}$, subtracting ground-truth subject similarities from generated-to-ground-truth similarities to isolate generation-induced confusion from natural subject resemblance. This separates self-degradation (diagonal) from cross-subject interference (off-diagonal) and exposes interpretable failure modes—drift, swap, dominance, and blending—that holistic metrics like CLIP or FID miss.
This paper challenges the monolithic paradigm in pose-free feed-forward 3D Gaussian Splatting (3DGS), where a single network jointly estimates camera poses and synthesizes Gaussians. The authors propose 2Xplat, a modular two-expert framework that decouples geometry estimation (using Depth Anything 3) from appearance synthesis (using Multi-view Pyramid Transformer) via an explicit pose interface. The core claim is that separating these concerns enables superior training efficiency (<5K iterations) and novel-view synthesis quality competitive with posed methods, challenging the assumption that unified architectures are optimal.
Multi-Object Tracking (MOT) models often degrade during inference due to distribution shifts between training and test data. This paper proposes TCEI (Test-time Calibration from Experience and Intuition), a cognitive-inspired framework that uses transient memory for short-term guidance and accumulated experience for long-term calibration. Unlike traditional TTA methods that require backpropagation, TCEI operates entirely via forward propagation, adapting identity predictions in real-time without additional training.
MS-CustomNet tackles multi-subject customization for text-to-image diffusion models, where the challenge is to preserve multiple subject identities while controlling their compositional arrangement and spatial relationships. The authors propose a framework built on CustomNet that accepts multiple reference images plus a layout map $M_L$ specifying spatial arrangement, trained on a curated MSI dataset derived from COCO. The work aims to provide explicit deterministic control over subject placement and layering (e.g., "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.
PAS3R tackles online monocular 3D reconstruction from long video streams, addressing the stability–adaptation dilemma where models must incorporate novel viewpoints without overwriting historical scene structure. The core idea is to dynamically modulate state update intensity based on geometric novelty: measuring inter-frame camera displacement (translation + rotation) and image frequency content via Fourier analysis. This enables faster adaptation to abrupt viewpoint changes while preserving accumulated geometry during smooth motion.
EmoTaG tackles few-shot 3D talking-head synthesis with emotional expressiveness using only 5 seconds of target video. The core insight is to predict FLAME parameters (expression and jaw pose) rather than directly deforming 3D Gaussians, providing explicit geometric priors for stability. A Gated Residual Motion Network (GRMN) disentangles phonetic articulation from emotion-driven variations with a learned gate $g \in [0,1]$, while Semantic Emotion Guidance distills knowledge from a pretrained DeepFace recognizer to supervise emotional intensity without manual labels.