Nothing here yet
The paper addresses sample-efficient selection among multiple pretrained generative models, formulated as a diversity-aware multi-armed bandit problem where the optimal solution may be a mixture rather than a single model. The authors challenge the necessity of explicit UCB exploration bonuses, proposing that Mixture-Greedy—which directly optimizes empirical diversity objectives without optimism bonuses—can achieve sublinear regret through implicit exploration induced by the objective geometry. This matters because sampling from suboptimal generative models is computationally expensive, and their results suggest that structural properties of diversity metrics (FID, Vendi, RKE) naturally enforce sufficient exploration without costly confidence bound computations.
CICTM addresses deformable brain MRI registration by combining transformer-based global context modeling with cycle inverse-consistency constraints. The core idea uses a Swin-UNet to jointly estimate forward and backward deformation fields, penalizing inconsistencies at both image and flow levels while enforcing topology preservation via Jacobian regularization. The work matters for large-scale neuroimaging studies where deformation stability and physical plausibility are as important as alignment accuracy.
Long-tail class incremental learning (LT-CIL) suffers from scarce tail-class data and catastrophic forgetting. This paper tackles both issues by using large language models to generate a stratified language tree (SL-Tree) that hierarchically organizes semantic information from coarse to fine granularity. Two parallel guidance mechanisms—adaptive language guidance with learnable per-class weights and alignment language guidance using semantic space stability—dynamically supervise tail classes and constrain optimization. The approach achieves reported state-of-the-art results on ImageNet-R, CIFAR100, and CUB200 benchmarks.
This paper tackles the inefficiency of Interleaved-Modal Chain-of-Thought (ICoT) reasoning, where current methods statically insert visual tokens after every reasoning step, wasting compute on redundant image embeddings and using semantically broken patches. DaP-ICoT introduces a confidence-aware gating mechanism that only pulls visual context when model certainty drops below a threshold, combined with SAM2-based object segmentation to provide coherent visual thoughts instead of fragmented patches.
This work addresses zero-shot detection of AI-generated images by measuring how Vision Foundation Model (VFM) representations respond to structured high-frequency perturbations. The core idea is that synthetic images contain characteristic frequency biases, causing their embeddings to shift differently than real images when high-frequency noise is applied to local patches. The method achieves strong detection accuracy while requiring only a single Fourier transform and one forward pass, making it one to two orders of magnitude faster than comparable training-free approaches.
MetaCompress addresses token reduction for multi-turn VQA in Large Vision-Language Models, where future questions are unpredictable and may target any image region. The paper proposes a learning-based prompt-agnostic compression module trained via KL divergence minimization between original and compressed outputs, demonstrating that heuristic attention-based pruning is suboptimal for this scenario. The method achieves strong efficiency-accuracy trade-offs across five LVLM architectures while training on only ~20k samples.
This paper tackles the domain generalization problem in image deraining, where models trained on synthetic data fail catastrophically on out-of-distribution (OOD) real-world scenarios. The authors propose a three-stage pipeline—Superpixel Generation, Resolution-adaptive Fusion, and Pseudo-label Re-Synthesis—that adapts source-domain models to target domains using only unpaired rain-free images, eliminating the need for costly paired rainy data collection.
CataractSAM-2 adapts Meta's Segment Anything Model 2 (SAM-2) for real-time semantic segmentation in cataract surgery videos. The core idea is to fine-tune only the prompt encoder and mask decoder while freezing the image encoder, enabling precise segmentation of anatomical structures and surgical instruments under challenging conditions like glare and occlusion. The paper also introduces an interactive annotation framework that propagates sparse user prompts across video frames to accelerate ground-truth generation.
This paper tackles SAR (Synthetic Aperture Radar) automatic target recognition under coherent speckle noise. It proposes FSCE, a framework combining frequency-domain wavelet decomposition with spatial multi-scale convolutions in a shallow feature enhancement module (DSAF), guided by online knowledge distillation from a ResNet101 teacher. The work matters because SAR imagery suffers from unique multiplicative noise that obscures target features, yet the claimed improvements appear marginal on saturated benchmarks.
Multi-modal tracking suffers from scarce paired training data, forcing reliance on RGB pre-trained models with lightweight fine-tuning. PATrack proposes a progressive adaptation framework using three complementary adapters—Modality-Dependent (MDA), Cross-Modality Entangled (CEA), and Head Adaptation (HA)—to bridge the domain gap between RGB and auxiliary modalities (Thermal, Depth, Event) at the intra-modal, inter-modal, and task levels. The approach decomposes features into frequency bands and uses fusion-guided cross-attention, yielding state-of-the-art results on LasHeR, RGBT234, and VisEvent benchmarks.
Phase unwrapping recovers absolute interferometric phase from wrapped $2\pi$-modulo observations, but fails near surface-breaking faults that create abrupt discontinuities and in large-scale scenes that exceed GPU memory. This work proposes a diffusion-based framework that conditions on SNAPHU estimates and processes large interferograms via overlapping 256$\times$256 tiles with weighted averaging. It claims to handle fault-related phase jumps and scale to real-world Sentinel-1 interferograms without resizing.
Thyroid ultrasound reporting requires joint assessment of nodule boundaries and TI-RADS risk categories, yet annotator variability creates inconsistent supervision that destabilizes standard multitask learning. This paper proposes RLAR (Representation-Level Adversarial Regularization), which uses normalized adversarial directions in latent space as geometric probes of task sensitivity and penalizes excessive angular alignment between task gradients to control negative transfer. Combined with a clinically guided embedding that distills TI-RADS-aligned radiomics targets during training, the framework aims to stabilize joint segmentation and classification while grounding predictions in interpretable evidence.
Forward-looking sonar images suffer from severe speckle noise, acoustic shadows, and energy attenuation that break standard semi-supervised teacher-student frameworks. This paper proposes CTFS, a collaborative multi-teacher architecture where one general teacher and two sonar-specific teachers (simulating acoustic shadows and energy decay) alternate to guide a student model. A cross-teacher reliability assessment mechanism filters noisy pseudo-labels by measuring prediction consistency across teacher views. The work matters because sonar annotation is expensive and existing methods fail with <10% labels due to domain mismatch.
Domain Elastic Transform (DET) addresses the registration of high-dimensional vector-valued functions on irregular, sparse manifolds—a critical bottleneck in spatial transcriptomics where gene expression data resides on scattered cell positions rather than regular grids. The core idea is a Bayesian framework that treats registration as elastic domain deformation guided by a joint spatial-functional likelihood, bypassing the lossy voxelization required by image-based methods while exploiting functional signals that pure geometric point-set registration ignores. This matters because it enables training-free analysis of massive atlases (e.g., MERFISH, Stereo-seq) without sacrificing single-cell resolution.
QMoP tackles the computational bottleneck in multimodal LLMs caused by excessive visual tokens, which dwarf text tokens in memory and compute costs. The paper proposes a Query Guided Mixture-of-Projector that dynamically combines three compression strategies—pooling for global semantics, resampling for high-level features, and pruning for fine-grained details—via a learned router. This adaptive approach matters because fixed compression rules inherently sacrifice different information types (global context vs. local details) depending on the task.
Current multimodal large language models rely on expensive annotated data or teacher distillation for reasoning improvements. This paper proposes an unsupervised self-evolution framework that trains without ground-truth labels or external reward models by instantiating dual roles—an Actor that generates multiple reasoning trajectories and a frozen Judge that modulates consistency-based rewards. The method employs group-wise distributional modeling using Group Relative Policy Optimization (GRPO) to convert absolute scores into relative advantages, achieving up to +5.9 absolute accuracy gains on MathVision while maintaining healthier training entropy than majority-voting baselines.
Sonny tackles the compute barrier in medium-range weather forecasting by proposing a hierarchical transformer that trains on a single A40 GPU in 5.5 days. The core idea is a two-stage StepsNet pipeline: a narrow 'slow path' processes large-scale dynamics (U,V,Z,P) first, then a full-width 'fast path' integrates thermodynamics (T,Q). Combined with EMA during training, randomized dynamics forecasting, and pressure-weighted losses, Sonny aims to deliver competitive forecast skill without the TPU/GPU cluster requirements of models like Pangu-Weather or GraphCast.
Existing counterfactual image generation methods produce either global changes or require tedious user-defined masks. This paper proposes Positional Seg-CFT, which subdivides anatomical structures into regional segments (e.g., proximal, mid, distal) and derives independent measurements per region from pretrained segmentors. The extension enables spatially localized interventions for modeling regional disease progression, demonstrated on coronary CT angiography.
This paper tackles Practical Test-Time Adaptation (PTTA), where models must adapt to temporally correlated, non-i.i.d. test streams without source data. Unlike prior work that stores samples in a single pool, the authors propose Multi-Cluster Memory (MCM)—organizing memory into multiple clusters based on pixel-level descriptors. The core insight, validated via Gaussian Mixture Model analysis, is that PTTA streams are inherently multi-modal (optimal K* ≈ 6–10), making single-cluster memory structurally mismatched. MCM introduces descriptor-based assignment, Adjacent Cluster Consolidation (ACC), and Uniform Cluster Retrieval (UCR), achieving consistent gains up to 12.13% on DomainNet.
This paper addresses temporal action localization (TAL) for distracted driver behaviors in untrimmed in-cabin videos, a critical task for intelligent transportation systems. The authors propose a two-stage framework combining VideoMAE-based feature extraction with an Augmented Self-Mask Attention (AMA) detector enhanced by a Spatial Pyramid Pooling-Fast (SPPF) module for multi-scale temporal modeling. The work targets deployment scenarios such as fleet management and transportation safety checkpoints, aiming to balance accuracy against computational constraints.