Nothing here yet
Open-vocabulary panoptic segmentation aims to recognize and segment arbitrary object categories beyond training vocabularies, but suffers from two coupled failures: mask transformers discard proposals for unseen categories due to biased objectness scoring, while CLIP's global image-text alignment poorly localizes to image regions. OVRCOAT addresses both via COAT—which adjusts foreground probabilities using CLIP's classification confidence to rescue out-of-vocabulary masks—and OVR, a memory-efficient fine-tuning protocol for region-text alignment. The approach achieves +5.5% PQ gains on ADE20K and reduces training memory by 56% versus prior SOTA.
Existing visual privacy benchmarks treat privacy as a binary property, but this work argues that privacy is fundamentally compositional: benign attributes in isolation can combine to create severe violations. The authors introduce the Compositional Privacy Risk Taxonomy (CPRT), a four-level framework aligned with regulations like GDPR and HIPAA that assigns continuous severity scores based on attribute interactions. They construct a dataset of 6,736 images annotated for 22 privacy attributes and evaluate frontier vision-language models, finding that while structured taxonomic guidance improves alignment, models systematically underestimate composition-driven risks.
UniMotion addresses the fragmentation in human motion modeling by unifying motion, text, and RGB understanding/generation within a single 1.5B parameter architecture. Unlike prior work relying on discrete tokenization or handling only partial modality subsets, it treats motion as a continuous first-class modality via a Cross-Modal Aligned Motion VAE (CMA-VAE). The framework introduces Dual-Posterior KL Alignment to distill visual semantics into motion representations without requiring images at inference, and Latent Reconstruction Alignment to bootstrap the motion pathway through dense self-supervision before sparse text calibration.
StreamingEval introduces a unified evaluation framework for Video-LLMs under realistic streaming constraints, moving beyond offline benchmarks to assess continuous, real-time video understanding with limited memory. The protocol enforces a fixed-capacity memory bank and jointly measures encoding throughput (MaxFPS), decoding latency (TTFT), memory usage, and task accuracy via a composite StreamingScore. Experiments reveal that current "online" models often fail under strict streaming constraints, while offline models adapted with FIFO memory banks frequently outperform specialized streaming architectures at the cost of higher resource consumption.
F4Splat tackles inefficient Gaussian allocation in feed-forward 3D Gaussian Splatting (3DGS), where existing methods uniformly assign Gaussians per pixel or voxel, causing redundancy and fixed budgets. The core idea is a learnable densification score that predicts spatial regions needing additional Gaussians based on geometric complexity and multi-view overlap, enabling adaptive allocation and explicit budget control without retraining. This matters because it delivers compact scene representations—using 10–28% of the Gaussians of prior work—while maintaining or improving rendering fidelity.
PRM-as-a-Judge addresses the fundamental limitation of binary success metrics in robotic manipulation by repurposing Process Reward Models (PRMs) as dense evaluators. The paper introduces the OPD (Outcome–Process–Diagnosis) metric system, which decomposes execution quality via a task-aligned progress potential $\Phi(x_t) \in [0,1]$ induced from trajectory videos. Validated on the RoboPulse benchmark and RoboTwin policy auditing, the work shows that trajectory-supervised PRMs achieve superior micro-resolution compared to foundation models, revealing behavioral signatures invisible to outcome-only evaluation.
This paper addresses long-tailed (LT) learning by proposing that the head-tail performance trade-off stems from "tail performance degradation"—where models overfit to head classes and forget tail classes. The core idea reframes LT learning as continual learning, using a Grouped Knowledge Preservation (GKP) module to maintain class-specific optimal parameters and a Grouped Sharpness Aware (GSA) module to find flatter minima. The method operates without external data or pre-trained models, showing improvements on CIFAR-LT, ImageNet-LT, and iNaturalist benchmarks.
This paper introduces a new computer vision task called Anytime Interframe Semantic Segmentation: predicting dense semantic segmentation at arbitrary timestamps between low-frame-rate RGB frames using only a past frame and asynchronous event data. The core idea is feature propagation via event-driven motion fields rather than direct multi-modal fusion. The method is motivated by the perceptual gaps created by LFR cameras in high-speed autonomous driving scenarios, where critical events (e.g., pedestrians entering paths) may be missed between frames.
Single-view reference-to-video methods struggle to preserve identity when faces rotate through large angles. This paper proposes Mv2ID, a multi-view conditioning framework that uses region-masking and a decoupled positional encoding scheme to prevent view-dependent copy-paste artifacts without requiring expensive cross-paired training data. The work is relevant for digital character creation and visual effects where identity must remain consistent across extreme viewpoints.
The paper addresses the problem of identifying "Who said what and when" in multi-speaker video conversations, which current Omni-modal LLMs fail at due to sparse visual sampling (1-2 fps) and "shortcut learning" on visual biases. The authors introduce VR-SDR (Visual-Registered Speaker Diarization and Recognition), a rigorous benchmark that forces models to bind identities from natural language descriptions without visual shortcuts. They propose HumanOmni-Speaker, featuring a Visual Delta Encoder that samples video at 25 fps yet compresses inter-frame motion residuals into only 6 tokens per frame to capture fine-grained visemes while avoiding token explosion.
Instance segmentation in dense microscopy images requires separating tightly packed, touching cells—a task where binary masks and pixel-wise losses often merge adjacent instances. This paper proposes predicting continuous signed distance functions (SDFs) mapped to probabilistic segmentations via a learnable sigmoid, trained with a differentiable Modified Hausdorff Distance (MHD) loss. The approach eliminates the need for interactive prompting or watershed post-processing while aiming to improve boundary fidelity in high-throughput cellular imaging.
MRI acquisition is inherently slow due to sequential k-space sampling. This paper proposes TRUST-MRI, an active sampling framework that leverages discrete anatomical tokens from the pretrained MedITok tokenizer and a latent Transformer to guide measurement selection. The core innovation uses token prediction entropy as an uncertainty signal, introducing two policies: Latent Entropy Selection (LES) projects patch-wise entropy to k-space to select lines, while Gradient-based Entropy Optimization (GEO) uses gradients of total entropy with respect to input measurements. The approach trades pixel-wise fidelity for perceptual quality and computational efficiency, achieving superior feature-based metrics while running at 0.97 fps compared to 0.01 fps for diffusion-based active methods.
RGBD SLAM with 3D Gaussian Splatting (3DGS) struggles to balance scalability against rendering fidelity: global Gaussians consume excessive GPU memory, while view-tied Gaussians (fixed at depth) suffer from limited novel-view quality. This paper proposes pixel-aligned Gaussians that can adjust their positions along viewing rays via learned depth offsets, paired with a fast geometry-similarity tracking strategy using Generalized ICP on depth distributions. The approach claims state-of-the-art rendering and tracking performance while maintaining smaller active memory footprints than prior 3DGS-based methods.
Stand-up comedy depends as much on timing and embodied presence as on verbal content, yet computational humor has largely focused on text alone. This paper introduces TIC-TALK, a multimodal corpus of 90 professionally filmed Netflix specials (2015–2024) with temporally aligned annotations for language, gesture, and audience response. The processing pipeline combines BERTopic for thematic segmentation, Whisper-AT for laughter detection, and YOLOv8 for shot classification and pose keypoint extraction, all aligned hierarchically without resampling. The authors validate the resource through corpus-level findings including a negative correlation between kinetic energy and laughter rate ($r = -0.75$), consistent with a stillness-before-punchline pattern, and through a short-horizon laughter prediction benchmark.
This paper tackles Unsupervised Continuous Anomaly Detection (UCAD), where models must sequentially learn new product categories without forgetting previous ones or storing all raw data. The core idea is to augment visual-only approaches with learnable text prompts from CLIP, storing both modalities in a Continuous Multimodal Prompt Memory Bank (CMPMB) and fusing them via a Defect-Semantic-Guided Adaptive Fusion Mechanism (DSG-AFM). Benchmarked on MVTec AD and VisA, the authors claim state-of-the-art detection accuracy (+4.4% AUROC) and segmentation (+14.8% AUPR) over the prior UCAD baseline.
ThinkJEPA addresses the limitation of JEPA-style latent world models that rely on short, densely sampled windows, which bias predictions toward local dynamics while missing long-horizon semantics. The paper proposes a dual-temporal architecture combining a dense-frame V-JEPA branch for fine-grained motion with a sparsely sampled VLM "thinker" branch that provides semantic guidance via multi-layer feature pyramids. This matters because it attempts to marry the physical consistency of latent world models with the general knowledge of vision-language models for robust trajectory forecasting.
CVT-Bench evaluates whether multimodal LLMs can maintain stable spatial representations under counterfactual viewpoint transformations—such as inferring object relationships from a camera angle never shown in the image. Using 100 synthetic tabletop scenes and 6,000 relational queries across rotations from $0^{\circ}$ to $360^{\circ}$, the benchmark reveals that state-of-the-art models, despite high single-view accuracy, systematically fail at mental rotation tasks and degrade further under extended sequential context. These findings challenge the assumption that strong episodic spatial performance implies robust viewpoint-invariant representations, with critical implications for embodied AI and robotics applications requiring perspective-taking.
Autoregressive video diffusion models struggle with minute-scale generation due to error accumulation in long-horizon rollouts. This paper challenges the assumption that more memory is better, proposing instead to decompose KV-cache conditioning into three functional roles—Sink for global anchors, Tail for recent continuity, and dynamically selected History for mid-range structure. The result is a training-free inference method that improves motion dynamics by 66.8% while cutting attention overhead by roughly 2.6×.
This paper tackles the Close Small Object Unmixing (CSOU) problem for infrared imagery, where distant clustered targets appear as overlapping mixed spots due to optical diffraction limits. The authors propose DSCSNet, a deep-unfolded network that unrolls the ADMM algorithm with learnable parameters to recover target count, sub-pixel positions, and radiant intensities from mixed spots. The core idea is to replace the traditional ℓ2-norm smoothness terms with strict ℓ1-norm sparsity constraints and add a dynamic thresholding mechanism for scene-adaptive reconstruction.
FluidGaussian addresses a critical gap in 3D reconstruction: methods optimized solely for photometric losses often produce visually plausible but physically implausible geometries that fail in downstream simulations. The paper proposes coupling 3D Gaussian Splatting with incompressible fluid simulation (SPH/DFSPH) to define a simulation-based uncertainty metric—velocity divergence at the fluid-structure interface—and integrates it into active view selection. By reranking next-best-view candidates using this physical signal, the method improves both visual fidelity (PSNR) and physical plausibility (divergence) on synthetic and aerodynamic datasets.