Nothing here yet
3D fragment reassembly becomes challenging at scale because incorrect contact adjacencies trigger cascading failures. This paper proposes SARe, a generative framework that explicitly models contact structure by jointly predicting fracture-surface tokens and inter-fragment adjacency graphs, paired with an inference-time refinement stage that anchors reliable substructures to correct uncertain regions. The work demonstrates state-of-the-art results across synthetic and real fracture datasets, with notable improvements in the many-fragment ($K$) regime.
This paper tackles unregistered hyperspectral-multispectral image fusion (HMF), where spatially misaligned images with partial overlap must be mutually super-resolved without training data or co-registration. The authors propose FRESCO, a two-stage unsupervised framework that uses coupled block-term tensor decomposition (BTD) for MSI spectral super-resolution and latent-space adversarial learning for HSI spatial super-resolution. The work is notable for offering the first theoretical recoverability guarantees in the unregistered setting, addressing a practically important gap in remote sensing.
Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes models using three coordinated caches—a fixed source-domain prototype cache, a dynamic positive target cache for reliable samples, and a negative cache for uncertain predictions—coupled with a tri-gate filtering mechanism to prevent error accumulation.
DepthTCM tackles depth map compression by combining physics-inspired Multiwavelength Depth (MWD) encoding—mapping depth to sinusoidal 3-channel images—with global 4-bit quantization and a Transformer-CNN mixed learned codec. The core claim is that this hybrid approach reshapes depth statistics into a form amenable to modern learned image compression, achieving 60% bitrate reduction over prior MWD methods while maintaining >99% geometric accuracy.
The paper addresses the quadratic complexity of transformer attention and limited local detail extraction in RGB-D Salient Object Detection (SOD). It proposes STENet, which introduces superpixels as intermediate tokens to reduce computational overhead while preserving structural coherence. The core idea replaces global pixel-to-pixel attention with two modules: one for pixel-to-superpixel global enhancement and another for intra-superpixel local refinement, aiming to balance efficiency and accuracy.
BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.
Video subtitle removal traditionally requires expensive per-frame mask annotations and external detection modules during both training and inference. CLEAR introduces a two-stage mask-free framework that decouples prior extraction (via self-supervised disentangled feature learning) from generative refinement (via LoRA-adapted diffusion with adaptive weighting). The method claims to train only 0.77% of base model parameters while achieving +6.77dB PSNR gains and zero-shot generalization across six languages without ground-truth masks at inference.
daVinci-MagiHuman tackles joint audio-video generation using a refreshingly simple single-stream Transformer that processes text, video, and audio tokens through self-attention only---avoiding the cross-attention and fusion modules common in competing multi-stream architectures. The model achieves strong human-centric generation quality across six languages while delivering impressive inference speed: 2 seconds for a 5-second 256p video on an H100.
Beta-KD tackles the problem of balancing data supervision against teacher guidance when distilling multimodal large language models. The authors frame knowledge distillation as Bayesian MAP estimation with teacher-informed Gibbs priors over student activations, deriving a closed-form uncertainty-aware weighting mechanism via Laplace approximation. This eliminates manual tuning of loss weights and achieves consistent improvements across six VQA benchmarks.
This paper addresses hypertension screening from inexpensive retinal fundus images by distilling knowledge from high-fidelity brain MRI—without requiring paired acquisitions from the same patients. The proposed Clinical Graph-Mediated Distillation (CGMD) constructs a clinical similarity graph using shared biomarkers (age, labs, etc.) to bridge disjoint MRI and fundus cohorts, propagates MRI teacher embeddings over the graph to impute patient-specific targets for fundus patients, and trains a fundus student with supervised, prior, and relational distillation losses. The approach aims to capture subtle vascular signals in fundus images by leveraging MRI-derived markers of small-vessel disease.
HMS-VesselNet addresses the challenge of segmenting thin peripheral retinal vessels in fundus images—a critical task for early diabetic retinopathy detection where standard overlap losses fail due to class imbalance and topological fragmentation. The paper proposes a four-scale hierarchical Attention U-Net architecture with learned fusion weights, combining Dice, binary cross-entropy, and centerline Dice ($\text{clDice}$) losses alongside hard example mining to boost sensitivity on sub-2-pixel vessels. Evaluated on 68 images from DRIVE, STARE, and CHASE_DB1 via 5-fold cross-validation and leave-one-dataset-out protocols, the model achieves $90.78\pm1.42\%$ Sensitivity, demonstrating that explicit topology preservation and targeted hard example oversampling can recover fine vascular structures missed by standard area-based losses.
Vision-Language Models face escalating safety risks from adversarial jailbreak attacks that bypass alignment via manipulated visual inputs. This paper introduces NullSteer, a training-free defense that applies activation steering constrained to the null space of benign representations—mathematically guaranteeing that safe inputs remain unchanged while harmful activations are redirected toward refusal semantics. The approach aims to solve the over-refusal problem plaguing existing steering methods, offering a principled trade-off between robust safety and preserved utility.
Most visual counting benchmarks focus on rigid objects like crowds and vehicles, leaving fine-grained biological counting understudied. This paper introduces TPC–268, a dataset of 10,000 images spanning 268 countable plant categories across 242 species, annotated with full Linnaean taxonomies and biological organization levels. By framing plant counting as class-agnostic counting with taxonomic constraints, the authors provide a testbed for evaluating hierarchical generalization in vision models.
This paper tackles Chinese Mandarin visual speech recognition (VSR),where the tonal nature of the language and large vocabulary make lipreading more challenging than for non-tonal languages like English. Existing approaches use cascade architectures with intermediate representations like pinyin to bridge the gap,but this introduces error accumulation and increases inference latency. The core idea is a cascade-free multitask architecture that jointly learns phoneme and viseme representations during training, with on-demand activation during inference for efficiency-accuracy trade-offs. This matters because cascade-free designs could eliminate error propagation while maintaining the benefits of intermediate representations.
Multi-focus image fusion (MFIF) combines source images from different focal planes into a single all-in-focus image. This paper targets a critical flaw in diffusion-based MFIF: defocus blur warps geometric structures, producing artifacts. The authors propose ReDiffuse, which embeds B-Conv (Fourier-series-based rotation-equivariant filters) into a U-Net diffusion backbone. By enforcing that rotations induce predictable feature transformations, the method aims to preserve edge orientation and structural consistency while reducing model size through parameter sharing.
Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS, which combines parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) in the text encoder with a hierarchical textual guidance strategy for WSI representation learning. The core innovation replaces expensive cross-attention mechanisms with lightweight linear transformations $y = \gamma \cdot x + \beta$ while avoiding hard instance filtering through soft cosine-similarity-based attention refinement, achieving up to 13.8\% accuracy gains with 18.1\% fewer parameters than state-of-the-art methods.
Most road extraction benchmarks focus on binary segmentation, lacking the hierarchical attributes critical for transport infrastructure planning and management. This paper introduces SYSU-HiRoads, a large-scale dataset spanning 3,631 km² with aligned pixel masks, vector centerlines, and three-level road grades, alongside RoadReasoner—a framework that combines frequency-domain feature extraction with vision-language models to infer road hierarchy from geometric descriptors. The work bridges a significant gap in automated mapping by moving beyond "where are the roads" to "what roles do these roads play."
This paper addresses privacy-preserving facial expression recognition (FER) in video without requiring identity labels—a critical gap since real-world deployment often lacks identity annotations. The core idea leverages intra- and inter-video knowledge priors to train an identity suppression network followed by a denoising module, enabling open-set privacy preservation. This matters because current methods either require closed-set identity supervision or suffer from entangled privacy-utility trade-offs that degrade performance.
The paper tackles semi-supervised 3D rotation regression from monocular images, addressing the rigidity of fixed entropy thresholds in pseudo-label filtering used by prior work like FisherMatch. It proposes HACMatch, a hardness-aware curriculum learning framework that dynamically selects unlabeled samples by difficulty using either multi-stage or adaptive strategies, paired with PoseMosaic, a patch-based augmentation that applies diverse transformations while preserving geometric integrity. This matters because rotation annotations are expensive to obtain, and effectively leveraging unlabeled data could reduce costs for autonomous driving and robotics applications.
Federated video action recognition faces a dual challenge: gradient sharing risks leaking sensitive motion patterns, while synchronizing high-dimensional video models incurs prohibitive bandwidth costs. This paper proposes FedDP-STECAR, which selectively fine-tunes only task-relevant layers under differential privacy and transmits only those layers, claiming over 99% communication reduction alongside strong privacy guarantees ($\epsilon \leq 1.33$). The work matters for enabling practical privacy-preserving video analysis in healthcare and surveillance where data cannot be centralized.