Nothing here yet
This paper investigates whether video transformers can detect respiratory distress from video recordings of post-exercise recovery. The authors frame the problem as a temporal ordering task—predicting which of two clips shows greater shortness of breath—and propose augmenting ViViT with Lie Relative Encodings (LieRE) and Motion-Guided Masking (MGM). An F1 score of 0.81 is achieved, though on only 7 test videos from 3 participants.
Remote sensing text-to-image generation suffers from a lack of domain-specific diffusion transformers and prohibitive costs for high-resolution training. Existing training-free resolution promotion methods apply static RoPE scaling that uniformly compresses the spatial spectrum, which is particularly harmful for RS imagery due to its characteristically denser high-frequency energy. This paper proposes SHARP, a spectrum-aware dynamic adaptation strategy that uses a rational decay scheduler $\kappa_{rs}(t)$ to apply strong positional extrapolation early in denoising (for layout formation) while progressively relaxing it later (for detail recovery). The approach enables robust multi-scale generation up to 2.5$\times$ extrapolation factors with negligible overhead, addressing a critical gap in large-scale RS synthesis.
Cross-lingual dysarthria detection in Parkinson's disease is hampered by language-dependent structure in self-supervised speech representations that confounds pathology classification. This paper proposes a centroid-based 'language shift' (LS) that aligns source-language embeddings toward target-language distributions using only healthy control speech, enabling zero-shot transfer without model retraining. The approach addresses the critical data scarcity in clinical speech applications while aiming to disentangle linguistic variation from motor impairment markers.
Accurate respiratory motion modeling is critical for radiotherapy precision, yet patient-specific breathing patterns are difficult to predict outside observed ranges. This paper proposes PRISM-RM, a trajectory-aware implicit neural representation (INR) that models lung motion as a continuous diffeomorphic flow driven by external surrogate signals. By integrating neohookean hyperelastic constraints with temporal total-variation regularization, the method eliminates the need for fixed reference breathing states and aims to improve extrapolation to unseen respiratory phases.
Federated learning enables privacy-preserving medical AI but struggles with unreliable uncertainty estimates when clinical data is heterogeneous and imbalanced across sites. TrustFed addresses this by introducing representation-aware conformal prediction, which assigns test samples to calibration clients based on feature-space similarity and aggregates local thresholds via a soft-nearest strategy to provide finite-sample coverage guarantees without centralizing raw data. Validated on over 430,000 images across six distinct imaging modalities, the work advances federated learning from privacy-preserving training toward clinically trustworthy deployment with statistically calibrated uncertainty.
Pointing-based methods improve Large Vision-Language Models (LVLMs) by grounding objects before answering, yet the underlying mechanism remains unclear. This work investigates why pointing helps by comparing Direct Counting against Point-then-Count (PtC) in zero-shot counting tasks using synthetic data with controlled spatial layouts. The authors find that intermediate coordinate supervision encourages skill learning rather than narrow task memorization, yielding stronger out-of-distribution generalization while providing verifiable visual explanations.
This paper tackles the problem of speaker traits entangling with synthesis source information in speech deepfake source verification. The authors propose a Speaker-Disentangled Metric Learning (SDML) framework that combines Chebyshev polynomial approximations for gradient stability with Riemannian geometry (hyperbolic space) to separate speaker identity from source generator artifacts. Evaluated on four new cross-protocols using the MLAAD benchmark, the method aims to prevent models from relying on speaker shortcuts when verifying synthetic speech origins.
Diffusion models generate high-quality images but require hundreds of denoising steps, making deployment on edge devices impractical. This paper proposes Coarse-to-Fine Diffusion Models that start with low-resolution denoising early in the process (when outputs are noisy anyway) before switching to high-resolution, plus a fast time-step search method that finds good sampling schedules in under 10 minutes instead of days.
Parallel decoding promises faster text generation than autoregressive models but historically sacrifices quality due to simplified conditional independence assumptions. This paper introduces Gumbel Distillation, which leverages the Gumbel-Max trick to create a deterministic mapping from latent noise to teacher outputs, effectively providing the parallel student a blueprint for joint token distributions. By conditioning on Gumbel noise rather than relying on naive factorization, the method narrows the quality-efficiency gap, delivering substantial improvements across masked diffusion and multi-token prediction architectures.
Generative zero-shot learning (ZSL) synthesizes visual features for unseen classes conditioned on semantic prototypes, but existing methods often produce task-agnostic features that overlap for semantically similar yet visually distinct categories. This paper proposes RLVC, an outcome-reward reinforcement learning framework that treats the feature generator as a policy model and optimizes it using classifier confidence as the reward signal. The method further incorporates class-wise visual prototypes via a distillation loss to align synthesized features with real data distributions, achieving reported state-of-the-art results on CUB, SUN, and AWA2 benchmarks.
This paper tackles real-world image restoration (Real-IR) by adapting the 12B-parameter FLUX.1-dev flow matching model to low-level vision tasks. The core innovation is ResFlow-Tuner, which combines Unified Multi-Modal Fusion (UMMF) of image and text cues with a novel test-time scaling (TTS) paradigm that greedily optimizes ODE sampling trajectories using a multi-reward ensemble during inference. This establishes a new compute-quality trade-off for generative image restoration, showing that carefully perturbing intermediate flow states can yield substantial perceptual gains without retraining the base model.
The paper tackles the computational bottleneck of radiative transfer models (RTMs) for hyperspectral image (HSI) generation by proposing a VAE-based emulation framework that learns latent representations conditioned on biophysical parameters. It introduces both pixel-to-pixel (P2P) and fully convolutional (FC-VAE) variants, trained via either direct one-step mapping or a two-step pretraining strategy that decouples representation learning from parameter-to-latent interpolation. The work is significant for remote sensing applications as it provides empirical evidence that optimal emulator architecture depends critically on whether the target data is simulated (where P2P excels) or real-world imagery (where FC-VAE-pre dominates), and demonstrates that emulated data preserves downstream utility for parameter retrieval tasks.
Optical flow estimation traditionally requires expensive ground-truth annotations or relies on unreliable brightness constancy assumptions that fail under occlusion and illumination changes. This paper introduces GenOpticalFlow, a framework that synthesizes perfectly aligned training pairs by using monocular depth estimates to generate pseudo-optical flow, then conditioning a latent diffusion model to render corresponding next frames. The core innovation is converting unsupervised optical flow learning into a supervised training paradigm using synthetic data with geometrically consistent motion fields, potentially eliminating the need for manual annotation at scale.
Next app prediction struggles when user intent shifts rapidly and historical profiles are sparse. MISApp tackles this via multi-hop session graphs that decompose transitions into 1-, 2-, and 3-hop structural ranges, using LightGCN for lightweight propagation and a Transformer encoder-decoder to model intent evolution without requiring static user profiles, aiming for robust cold-start performance.
StreamingClaw addresses real-time streaming video understanding for embodied intelligence applications such as autonomous driving and robotics. The framework unifies continuous perception, hierarchical multimodal memory, and proactive interaction through a main–sub-agent architecture where StreamingReasoning orchestrates StreamingMemory and StreamingProactivity sub-agents. By integrating incremental KV-cache reuse with dynamic pruning, memory evolution from atomic actions to events, and trigger-based proactive responses, it aims to close the perception–decision–action loop for physical world deployment.
VIGIL tackles hallucination in multimodal deepfake detection by decoupling claim generation from evidence sourcing through a part-centric plan-then-examine pipeline. The framework first plans which facial parts to inspect using global visual cues, then examines each part with independently sourced forensic evidence delivered via a stage-gated injection mechanism. Combined with a progressive three-stage training paradigm featuring part-aware reinforcement learning rewards, the method aims to produce verifiable, anatomically grounded explanations rather than confabulated reasoning chains.
AnimalCLAP addresses zero-shot species recognition from vocalizations—a critical challenge for biodiversity monitoring when training data is scarce for rare species. The core idea is to inject hierarchical taxonomic knowledge (class, order, family, genus, species) into audio-text contrastive learning via multiple prompt templates, paired with a large dataset of 4,225 hours covering 6,823 species annotated with 22 ecological traits. This matters because it enables automated monitoring in visually occluded habitats like dense forests while inferring biological traits directly from sound.
MultiBind targets a critical blind spot in evaluating multi-subject image generators: cross-subject attribute misbinding, where models assign jackets, smiles, or poses to the wrong person. The benchmark grounds each test case in a real photograph (508 instances, 2–4 human subjects each) and provides slot-ordered crops, masks, background references, and long entity-indexed prompts (~474 words). Its core technical idea is the delta-matrix evaluation: for each attribute dimension $d$, compute $\Delta^{(d)} = S_{\mathrm{gen}}^{(d)} - S_{\mathrm{gt}}^{(d)}$, subtracting ground-truth subject similarities from generated-to-ground-truth similarities to isolate generation-induced confusion from natural subject resemblance. This separates self-degradation (diagonal) from cross-subject interference (off-diagonal) and exposes interpretable failure modes—drift, swap, dominance, and blending—that holistic metrics like CLIP or FID miss.
This paper challenges the monolithic paradigm in pose-free feed-forward 3D Gaussian Splatting (3DGS), where a single network jointly estimates camera poses and synthesizes Gaussians. The authors propose 2Xplat, a modular two-expert framework that decouples geometry estimation (using Depth Anything 3) from appearance synthesis (using Multi-view Pyramid Transformer) via an explicit pose interface. The core claim is that separating these concerns enables superior training efficiency (<5K iterations) and novel-view synthesis quality competitive with posed methods, challenging the assumption that unified architectures are optimal.
Multi-Object Tracking (MOT) models often degrade during inference due to distribution shifts between training and test data. This paper proposes TCEI (Test-time Calibration from Experience and Intuition), a cognitive-inspired framework that uses transient memory for short-term guidance and accumulated experience for long-term calibration. Unlike traditional TTA methods that require backpropagation, TCEI operates entirely via forward propagation, adapting identity predictions in real-time without additional training.