Nothing here yet
Predicting how complex scenes evolve is essential for intelligent systems, yet dense video generation expends enormous compute on appearance rather than dynamics. This paper introduces Myriad, an autoregressive diffusion model that predicts future motion via sparse point trajectories, explicitly avoiding the 'visual tax' of pixel-level generation. By modeling step-wise uncertainty accumulation through flow matching and utilizing fused transformer blocks, the method achieves throughput of 2200 samples/min compared to less than 1 for video models, while matching or exceeding their predictive accuracy on motion-focused benchmarks.
This paper introduces Vision Transformer (ViT), which applies a standard Transformer encoder directly to sequences of image patches for image classification. The core insight is that convolutional inductive biases (locality and translation equivariance) are unnecessary when models are pre-trained at sufficient scale—specifically on datasets containing 14M to 300M images. When transferred to downstream benchmarks, ViT matches or exceeds state-of-the-art CNNs while requiring substantially less computational resources to pre-train.
Dynamic visual effects like explosions require complex temporal reasoning that is difficult to capture in text prompts. P-Flow introduces a training-free framework that treats prompts as optimization variables, using vision-language models to iteratively refine descriptions based on discrepancies between generated and reference videos. The method combines flow-matching noise inversion with lightweight historical context to achieve model-agnostic customization without fine-tuning.
This paper addresses weakly-supervised video scene graph generation (WS-VSGG), where models must parse videos into structured relational triplets using only sparse unlocalized annotations without bounding boxes. The core insight is that off-the-shelf object detectors indiscriminately detect all visible objects, overwhelming relation models with noisy non-interactive pairs, while fully-supervised detectors implicitly filter relationally irrelevant objects. To bridge this gap, the authors propose a three-component framework: Relation-Aware Matching (RAM) refines pseudo-labels via vision-language grounding, Pair Affinity Learning and Scoring (PALS) learns to distinguish interactive from non-interactive pairs, and Pair Affinity Modulation (PAM) gates attention based on affinity scores. This substantially narrows the gap to full supervision while reducing annotation costs.
This paper tackles test-time adaptation (TTA) for large multimodal 3D vision-language models under distribution shifts. The core idea is BayesMM, which models both textual and geometric features as Gaussian distributions and fuses them via Bayesian model averaging. Unlike cache-based methods that store discrete samples, this approach claims to avoid progressive information loss and heuristic hyperparameter tuning while maintaining training-free operation.
GeoFusion-CAD tackles the scalability bottleneck in parametric CAD generation, where Transformer-based methods struggle with long command sequences due to quadratic attention costs. The authors propose an end-to-end diffusion framework that encodes CAD programs as hierarchical trees and processes them with G-Mamba blocks—geometry-conditioned state-space models that achieve linear complexity $\mathcal{O}(Ld)$ while capturing geometric and topological dependencies. This enables scaling to sequences of up to 240 commands while maintaining high geometric fidelity.
Multiple Instance Learning (MIL) for gigapixel pathology images relies on a single linear layer to transform general patch features into task-specific representations before aggregation. This paper identifies this linear layer as a critical yet overlooked bottleneck and proposes Mammoth, a parameter-efficient mixture-of-experts module that replaces it with multi-headed soft routing to specialized low-rank experts. By routing morphologically similar patches to distinct expert slots, Mammoth achieves superior performance without increasing model size, demonstrating that the feature transformation step matters more than the choice of aggregation function.
Single-image 3D reconstruction is fundamentally ill-posed: one view admits many valid 3D explanations, especially under occlusion and structural variation. This paper tackles the problem by learning an adaptive part-whole hierarchy rather than fixed-decomposition or monolithic representations. The core idea is a slot-based architecture where an image-conditioned gating mechanism predicts which latent structural slots to activate per instance, coupled with a class-agnostic prototype bank that aligns active slots to shared geometric priors via soft attention. This eliminates the need for user-specified part counts while encouraging cross-category reuse of recurring structural patterns like legs or handles.
Pre-trained vision encoders excel at 2D recognition but lack 3D spatial awareness. SpatialBoost addresses this by converting dense 3D spatial information from 2D images into linguistic expressions, then injecting them into frozen vision encoders via LLM-based training with a novel dual-channel attention mechanism. The framework improves performance on spatial tasks (depth estimation, robot control) while maintaining or enhancing general vision capabilities (ImageNet classification), suggesting language serves as an effective supervision signal for geometric understanding.
Hand-object interaction (HOI) video generation is currently split between pose-only synthesis, static appearance generation, and motion methods requiring ground-truth first frames. This paper introduces PAM, a three-stage Pose–Appearance–Motion engine that generates high-resolution HOI videos from only initial/target poses and object geometry, achieving true sim-to-real transfer. The system combines GraspXL for pose trajectory generation, Flux for appearance synthesis with multimodal ControlNet conditioning, and CogVideoX for motion generation, producing 480×720 videos while improving FVD from 38.83 to 29.13 on DexYCB compared to prior work.
Video-LLMs struggle with high computational costs from massive visual token volumes (e.g., 6,272 tokens for a 32-frame video). This paper challenges the standard two-stage spatiotemporal compression paradigm—which assumes spatial and temporal redundancy are separable—by reformulating compression as a global allocation problem. The authors propose a unified selection mechanism combining attention weights and semantic similarity to identify high-contribution, low-redundancy tokens, plus a text-aware merging module for secondary compression inside the LLM. The result is a training-free, plug-and-play method that retains ~90% performance with only 2% of tokens.
This paper benchmarks Vision Transformer backbones (ViT-B, ViT-L, ViT-H) within a Local pattern Self-Supervised Auxiliary Task (L-SSAT) framework. The core idea fuses Local Directional Pattern (LDP) texture descriptors with RGB inputs via Masked Autoencoder reconstruction as an auxiliary task to primary face classification. The study addresses whether a unified backbone exists across diverse face analysis tasks including deepfake detection (FaceForensics++), attribute prediction (CelebA), and emotion recognition (AffectNet).
Direct Preference Optimization (DPO) for Vision-Language Models suffers from Likelihood Displacement, where optimization collapses the probabilities of both chosen and rejected responses, causing models to abandon visual evidence for language priors. This paper proposes Asymmetric Constrained Preference Optimization (ACPO), which applies dynamic, length-aware scaling exclusively to the rejected reward term, preserving the chosen distribution as a stable anchor while selectively suppressing incorrect outputs.
This paper evaluates whether recurrent temporal modeling helps event-based object detection in industrial settings. The authors benchmark ReYOLOv8s (a recurrent ConvLSTM-augmented detector) against a vanilla YOLOv8s baseline on MTEvent, an industrial warehouse/factory dataset with 17 classes and severe class imbalance. The key question is whether memory across temporal clip lengths (3-21 frames) improves detection over single-window baselines.
This paper addresses interactive text-to-image retrieval (I-TIR) where diffusion models generate visual proxies from dialogue, but static additive fusion of text and generated images introduces harmful noise. The core idea is ADaFuSE, a lightweight plug-in module combining adaptive gating (to dynamically weight modalities per instance) with a semantic-aware mixture-of-experts branch (to capture fine-grained cross-modal cues). The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55.62% of queries suffer degradation under static fusion.
Long video understanding remains challenging for multimodal large language models due to limited context windows. VideoDetective addresses this by modeling videos as visual–temporal affinity graphs that fuse visual similarity with temporal continuity. The framework propagates query relevance through an iterative hypothesis–verification–refinement loop, enabling sparse but informed sampling of critical segments for question answering.
Deep S2P modernizes the Satellite Stereo Pipeline (S2P) by replacing classical SGM and MGM correlators with contemporary learned matchers including FoundationStereo, MonSter, and StereoAnywhere. The core technical contribution adapts the rectification stage to enforce unipolar disparities with proper altitude consistency and disparity range constraints, enabling off-the-shelf deep networks to operate on satellite imagery. This matters for operational Earth observation because it delivers sharper Digital Surface Models with finer geometric detail, though the work also candidly exposes how standard metrics saturate and how vegetation remains a stubborn failure mode.
Group3D addresses open-vocabulary 3D object detection from multi-view RGB images by integrating semantic constraints directly into instance construction. Unlike prior work that merges fragments based solely on geometric consistency, it leverages a multimodal large language model to organize scene vocabularies into semantic compatibility groups that gate cross-view fragment association. This prevents irreversible over-merging when geometric evidence is incomplete, achieving state-of-the-art results on ScanNet and ARKitScenes in both pose-known and challenging pose-free zero-shot settings.
Artistic font generation seeks to transfer visual styles from reference images onto text glyphs while preserving readability. This paper proposes a paradigm shift from feature-fusion or adapter-based diffusion approaches to visual in-context generation, treating element images as pixel-level context for an inpainting model (FLUX.1-Fill). The core innovation lies in repurposing image inpainting as style transfer: element images are concatenated with a blank canvas, and the model fills glyph masks by propagating visual cues from the reference. This enables high-fidelity texture preservation and fine-grained control via a lightweight Context-aware Mask Adapter (CMA), supporting both object elements (structured) and amorphous elements (textures).
This paper tackles the efficiency–generalization trade-off in Continual Test-Time Adaptation (CTTA), where models must adapt online to unlabeled streams under distribution shift without source data. The core insight is that feature updates need only occur within a low-rank "golden subspace" coinciding with the row space of the classifier. To avoid costly retraining, the authors propose using the Average Gradient Outer Product (AGOP) as an online proxy for the classifier weight structure, leading to the GOLD method that projects features onto this subspace and learns a compact scaling vector. If the theoretical claims hold under realistic nonlinear settings, this could significantly reduce deployment costs for adaptive systems.