Feed - arxlens

0

SARe: Structure-Aware Large-Scale 3D Fragment Reassembly

cs.CV Hanze Jia, Chunshi Wang, Yuxiao Yang et al. · Mar 23, 2026

3D fragment reassembly becomes challenging at scale because incorrect contact adjacencies trigger cascading failures. This paper proposes SARe, a generative framework that explicitly models contact structure by jointly predicting fracture-surface tokens and inter-fragment adjacency graphs, paired with an inference-time refinement stage that anchors reliable substructures to correct uncertain regions. The work demonstrates state-of-the-art results across synthetic and real fracture datasets, with notable improvements in the many-fragment ($K$) regime.

3D fragment reassembly aims to recover the rigid poses of unordered fragment point clouds or meshes in a common object coordinate system to reconstruct the complete shape. The problem becomes particularly challenging as the number of fragments grows, since the target shape is unknown and fragments provide weak semantic cues. Existing end-to-end approaches are prone to cascading failures due to unreliable contact reasoning, most notably inaccurate fragment adjacencies. To address this, we propose Structure-Aware Reassembly (SARe), a generative framework with SARe-Gen for Euclidean-space assembly generation and SARe-Refine for inference-time refinement, with explicit contact modeling. SARe-Gen jointly predicts fracture-surface token probabilities and an inter-fragment contact graph to localize contact regions and infer candidate adjacencies. It adopts a query-point-based conditioning scheme and extracts aligned local geometric tokens at query locations from a frozen geometry encoder, yielding queryable structural representations without additional structural pretraining. We further introduce an inference-time refinement stage, SARe-Refine. By verifying candidate contact edges with geometric-consistency checks, it selects reliable substructures and resamples the remaining uncertain regions while keeping verified parts fixed, leading to more stable and consistent assemblies in the many-fragment regime. We evaluate SARe across three settings, including synthetic fractures, simulated fractures from scanned real objects, and real physically fractured scans. The results demonstrate state-of-the-art performance, with more graceful degradation and higher success rates as the fragment count increases in challenging large-scale reassembly.

Read abstractHide abstract

0

Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability

eess.IV cs.CV Jiahui Song, Sagar Shrestha, Xiao Fu · Mar 23, 2026

This paper tackles unregistered hyperspectral-multispectral image fusion (HMF), where spatially misaligned images with partial overlap must be mutually super-resolved without training data or co-registration. The authors propose FRESCO, a two-stage unsupervised framework that uses coupled block-term tensor decomposition (BTD) for MSI spectral super-resolution and latent-space adversarial learning for HSI spatial super-resolution. The work is notable for offering the first theoretical recoverability guarantees in the unregistered setting, addressing a practically important gap in remote sensing.

This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.

Read abstractHide abstract

0

Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos

cs.CV Masoumeh Sharafi, Muhammad Osama Zeeshan, Soufiane Belharbi et al. · Mar 22, 2026

Video facial expression recognition (FER) suffers from severe subject-specific distribution shifts that degrade CLIP model performance at test time. This paper proposes TTA-CaP, a gradient-free test-time adaptation method that personalizes models using three coordinated caches—a fixed source-domain prototype cache, a dynamic positive target cache for reliable samples, and a negative cache for uncertain predictions—coupled with a tri-gate filtering mechanism to prevent error accumulation.

Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.

Read abstractHide abstract

0

DepthTCM: High Efficient Depth Compression via Physics-aware Transformer-CNN Mixed Architecture

cs.CV Young-Seo Chang, Yatong An, Jae-Sang Hyun · Mar 22, 2026

DepthTCM tackles depth map compression by combining physics-inspired Multiwavelength Depth (MWD) encoding—mapping depth to sinusoidal 3-channel images—with global 4-bit quantization and a Transformer-CNN mixed learned codec. The core claim is that this hybrid approach reshapes depth statistics into a form amenable to modern learned image compression, achieving 60% bitrate reduction over prior MWD methods while maintaining >99% geometric accuracy.

We propose DepthTCM, a physics-aware end-to-end framework for depth map compression. In our framework of DepthTCM, the high-bit depth map is first converted to a conventional 3-channel image representation losslessly using a method inspired by a physical sinusoidal fringe pattern based profiliometry system, then the 3-channel color image is encoded and decoded by a recently developed Transformer-CNN mixed neural network architecture. Specifically, DepthTCM maps depth to a smooth 3-channel using multiwavelength depth (MWD) encoding, then globally quantized the MWD encoded representation to 4 bits per channel to reduce entropy, and finally is compressed using a learned codec that combines convolutional and Transformer layers. Experiment results demonstrate the advantage of our proposed method. On Middlebury 2014, DepthTCM reaches 0.307 bpp while preserving 99.38% accuracy, a level of fidelity commensurate with lossless PNG. We additionally demonstrate practical efficiency and scalability, reporting average end-to-end inference times of 41.48 ms (encoder) and 47.45 ms (decoder) on the ScanNet++ iPhone RGB-D subset. Ablations validate our design choices: relative to 8-bit quantization, 4-bit quantization reduces bitrate by 66% while maintaining comparable reconstruction quality, with only a marginal 0.68 dB PSNR change and a 0.04% accuracy difference. In addition, Transformer--CNN blocks further improve PSNR by up to 0.75 dB over CNN-only architectures.

Read abstractHide abstract

0

STENet: Superpixel Token Enhancing Network for RGB-D Salient Object Detection

cs.CV Jianlin Chen, Gongyang Li, Zhijiang Zhang et al. · Mar 23, 2026

The paper addresses the quadratic complexity of transformer attention and limited local detail extraction in RGB-D Salient Object Detection (SOD). It proposes STENet, which introduces superpixels as intermediate tokens to reduce computational overhead while preserving structural coherence. The core idea replaces global pixel-to-pixel attention with two modules: one for pixel-to-superpixel global enhancement and another for intra-superpixel local refinement, aiming to balance efficiency and accuracy.

Transformer-based methods for RGB-D Salient Object Detection (SOD) have gained significant interest, owing to the transformer's exceptional capacity to capture long-range pixel dependencies. Nevertheless, current RGB-D SOD methods face challenges, such as the quadratic complexity of the attention mechanism and the limited local detail extraction. To overcome these limitations, we propose a novel Superpixel Token Enhancing Network (STENet), which introduces superpixels into cross-modal interaction. STENet follows the two-stream encoder-decoder structure. Its cores are two tailored superpixel-driven cross-modal interaction modules, responsible for global and local feature enhancement. Specifically, we update the superpixel generation method by expanding the neighborhood range of each superpixel, allowing for flexible transformation between pixels and superpixels. With the updated superpixel generation method, we first propose the Superpixel Attention Global Enhancing Module to model the global pixel-to-superpixel relationship rather than the traditional global pixel-to-pixel relationship, which can capture region-level information and reduce computational complexity. We also propose the Superpixel Attention Local Refining Module, which leverages pixel similarity within superpixels to filter out a subset of pixels (i.e., local pixels) and then performs feature enhancement on these local pixels, thereby capturing concerned local details. Furthermore, we fuse the globally and locally enhanced features along with the cross-scale features to achieve comprehensive feature representation. Experiments on seven RGB-D SOD datasets reveal that our STENet achieves competitive performance compared to state-of-the-art methods. The code and results of our method are available at https://github.com/Mark9010/STENet.

Read abstractHide abstract

0

Many Dialects, Many Languages, One Cultural Lens: Evaluating Multilingual VLMs for Bengali Culture Understanding Across Historically Linked Languages and Regional Dialects

cs.CL cs.CV Nurul Labib Sayeedi, Md. Faiyaz Abdullah Sayeedi, Shubhashis Roy Dipta et al. · Mar 22, 2026

BanglaVerse introduces a culturally grounded benchmark evaluating vision-language models on Bengali culture across standard Bangla, four historically linked languages, and five regional dialects. Built from 1,152 manually curated images expanded to ~32.3K artifacts, the work reveals that standard Bangla evaluation substantially overestimates model capabilities compared to dialectal settings. The core finding—that missing cultural knowledge, not visual grounding alone, is the primary bottleneck—challenges conventional multimodal evaluation practices for underrepresented languages.

Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.

Read abstractHide abstract

0

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

cs.CV Qingdong He, Chaoyi Wang, Peng Tang et al. · Mar 23, 2026

Video subtitle removal traditionally requires expensive per-frame mask annotations and external detection modules during both training and inference. CLEAR introduces a two-stage mask-free framework that decouples prior extraction (via self-supervised disentangled feature learning) from generative refinement (via LoRA-adapted diffusion with adaptive weighting). The method claims to train only 0.77% of base model parameters while achieving +6.77dB PSNR gains and zero-shot generalization across six languages without ground-truth masks at inference.

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

Read abstractHide abstract

0

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

cs.CV SII-GAIR, Sand.ai: Ethan Chern, Hansi Teng et al. · Mar 23, 2026

daVinci-MagiHuman tackles joint audio-video generation using a refreshingly simple single-stream Transformer that processes text, video, and audio tokens through self-attention only---avoiding the cross-attention and fusion modules common in competing multi-stream architectures. The model achieves strong human-centric generation quality across six languages while delivering impressive inference speed: 2 seconds for a 5-second 256p video on an H100.

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The model is particularly strong in human-centric scenarios, producing expressive facial performance, natural speech-expression coordination, realistic body motion, and precise audio-video synchronization. It supports multilingual spoken generation across Chinese (Mandarin and Cantonese), English, Japanese, Korean, German, and French. For efficient inference, we combine the single-stream backbone with model distillation, latent-space super-resolution, and a Turbo VAE decoder, enabling generation of a 5-second 256p video in 2 seconds on a single H100 GPU. In automatic evaluation, daVinci-MagiHuman achieves the highest visual quality and text alignment among leading open models, along with the lowest word error rate (14.60%) for speech intelligibility. In pairwise human evaluation, it achieves win rates of 80.0% against Ovi 1.1 and 60.9% against LTX 2.3 over 2000 comparisons. We open-source the complete model stack, including the base model, the distilled model, the super-resolution model, and the inference codebase.

Read abstractHide abstract

0

Uncertainty-Aware Knowledge Distillation for Multimodal Large Language Models

cs.CV Jingchen Sun, Shaobo Han, Deep Patel et al. · Mar 22, 2026

Beta-KD tackles the problem of balancing data supervision against teacher guidance when distilling multimodal large language models. The authors frame knowledge distillation as Bayesian MAP estimation with teacher-informed Gibbs priors over student activations, deriving a closed-form uncertainty-aware weighting mechanism via Laplace approximation. This eliminates manual tuning of loss weights and achieves consistent improvements across six VQA benchmarks.

Knowledge distillation establishes a learning paradigm that leverages both data supervision and teacher guidance. However, determining the optimal balance between learning from data and learning from the teacher is challenging, as some samples may be noisy while others are subject to teacher uncertainty. This motivates the need for adaptively balancing data and teacher supervision. We propose Beta-weighted Knowledge Distillation (Beta-KD), an uncertainty-aware distillation framework that adaptively modulates how much the student relies on teacher guidance. Specifically, we formulate teacher--student learning from a unified Bayesian perspective and interpret teacher supervision as a Gibbs prior over student activations. This yields a closed-form, uncertainty-aware weighting mechanism and supports arbitrary distillation objectives and their combinations. Extensive experiments on multimodal VQA benchmarks demonstrate that distilling student Vision-Language Models from a large teacher VLM consistently improves performance. The results show that Beta-KD outperforms existing knowledge distillation methods. The code is available at https://github.com/Jingchensun/beta-kd.

Read abstractHide abstract

0

Clinical Graph-Mediated Distillation for Unpaired MRI-to-CFI Hypertension Prediction

cs.CV Dillan Imans, Phuoc-Nguyen Bui, Duc-Tai Le et al. · Mar 23, 2026

This paper addresses hypertension screening from inexpensive retinal fundus images by distilling knowledge from high-fidelity brain MRI—without requiring paired acquisitions from the same patients. The proposed Clinical Graph-Mediated Distillation (CGMD) constructs a clinical similarity graph using shared biomarkers (age, labs, etc.) to bridge disjoint MRI and fundus cohorts, propagates MRI teacher embeddings over the graph to impute patient-specific targets for fundus patients, and trains a fundus student with supervised, prior, and relational distillation losses. The approach aims to capture subtle vascular signals in fundus images by leveraging MRI-derived markers of small-vessel disease.

Retinal fundus imaging enables low-cost and scalable hypertension (HTN) screening, but HTN-related retinal cues are subtle, yielding high-variance predictions. Brain MRI provides stronger vascular and small-vessel-disease markers of HTN, yet it is expensive and rarely acquired alongside fundus images, resulting in modality-siloed datasets with disjoint MRI and fundus cohorts. We study this unpaired MRI-fundus regime and introduce Clinical Graph-Mediated Distillation (CGMD), a framework that transfers MRI-derived HTN knowledge to a fundus model without paired multimodal data. CGMD leverages shared structured biomarkers as a bridge by constructing a clinical similarity kNN graph spanning both cohorts. We train an MRI teacher, propagate its representations over the graph, and impute brain-informed representation targets for fundus patients. A fundus student is then trained with a joint objective combining HTN supervision, target distillation, and relational distillation. Experiments on our newly collected unpaired MRI-fundus-biomarker dataset show that CGMD consistently improves fundus-based HTN prediction over standard distillation and non-graph imputation baselines, with ablations confirming the importance of clinically grounded graph connectivity. Code is available at https://github.com/DillanImans/CGMD-unpaired-distillation.

Read abstractHide abstract

0

HMS-VesselNet: Hierarchical Multi-Scale Attention Network with Topology-Preserving Loss for Retinal Vessel Segmentation

eess.IV cs.CV Amarnath R · Mar 23, 2026

HMS-VesselNet addresses the challenge of segmenting thin peripheral retinal vessels in fundus images—a critical task for early diabetic retinopathy detection where standard overlap losses fail due to class imbalance and topological fragmentation. The paper proposes a four-scale hierarchical Attention U-Net architecture with learned fusion weights, combining Dice, binary cross-entropy, and centerline Dice ($\text{clDice}$) losses alongside hard example mining to boost sensitivity on sub-2-pixel vessels. Evaluated on 68 images from DRIVE, STARE, and CHASE_DB1 via 5-fold cross-validation and leave-one-dataset-out protocols, the model achieves $90.78\pm1.42\%$ Sensitivity, demonstrating that explicit topology preservation and targeted hard example oversampling can recover fine vascular structures missed by standard area-based losses.

Retinal vessel segmentation methods based on standard overlap losses tend to miss thin peripheral vessels because these structures occupy very few pixels and have low contrast against the background. We propose HMS-VesselNet, a hierarchical multi-scale network that processes fundus images across four parallel branches at different resolutions and combines their outputs using learned fusion weights. The training loss combines Dice, binary cross-entropy, and centerline Dice to jointly optimize area overlap and vessel continuity. Hard example mining is applied from epoch 20 onward to concentrate gradient updates on the most difficult training images. Tested on 68 images from DRIVE, STARE, and CHASE_DB1 using 5-fold cross-validation, the model achieves a mean Dice of 88.72 +/- 0.67%, Sensitivity of 90.78 +/- 1.42%, and AUC of 98.25 +/- 0.21%. In leave-one-dataset-out experiments, AUC remains above 95% on each unseen dataset. The largest improvement is in the recall of thin peripheral vessels, which are the structures most frequently missed by standard methods and most critical for early detection of diabetic retinopathy.

Read abstractHide abstract

0

Principled Steering via Null-space Projection for Jailbreak Defense in Vision-Language Models

cs.CV Xingyu Zhu, Beier Zhu, Shuo Wang et al. · Mar 23, 2026

Vision-Language Models face escalating safety risks from adversarial jailbreak attacks that bypass alignment via manipulated visual inputs. This paper introduces NullSteer, a training-free defense that applies activation steering constrained to the null space of benign representations—mathematically guaranteeing that safe inputs remain unchanged while harmful activations are redirected toward refusal semantics. The approach aims to solve the over-refusal problem plaguing existing steering methods, offering a principled trade-off between robust safety and preserved utility.

As vision-language models (VLMs) are increasingly deployed in open-world scenarios, they can be easily induced by visual jailbreak attacks to generate harmful content, posing serious risks to model safety and trustworthy usage. Recent activation steering methods inject directional vectors into model activations during inference to induce refusal behaviors and have demonstrated effectiveness. However, a steering vector may both enhance refusal ability and cause over-refusal, thereby degrading model performance on benign inputs. Moreover, due to the lack of theoretical interpretability, these methods still suffer from limited robustness and effectiveness. To better balance safety and utility, we propose NullSteer, a null-space projected activation defense framework. Our method constructs refusal directions within model activations through a linear transformation: it maintains zero perturbation within the benign subspace while dynamically inducing refusal along potentially harmful directions, thereby theoretically achieving safety enhancement without impairing the model's general capabilities. Extensive experiments show that NullSteer significantly reduces harmful outputs under various jailbreak attacks (average ASR reduction over 15 percent on MiniGPT-4) while maintaining comparable performance to the original model on general benchmarks.

Read abstractHide abstract

0

Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species

cs.CV Jinyu Xu, Tianqi Hu, Xiaonan Hu et al. · Mar 22, 2026

Most visual counting benchmarks focus on rigid objects like crowds and vehicles, leaving fine-grained biological counting understudied. This paper introduces TPC–268, a dataset of 10,000 images spanning 268 countable plant categories across 242 species, annotated with full Linnaean taxonomies and biological organization levels. By framing plant counting as class-agnostic counting with taxonomic constraints, the authors provide a testbed for evaluating hierarchical generalization in vision models.

Visually cataloging and quantifying the natural world requires pushing the boundaries of both detailed visual classification and counting at scale. Despite significant progress, particularly in crowd and traffic analysis, the fine-grained, taxonomy-aware plant counting remains underexplored in vision. In contrast to crowds, plants exhibit nonrigid morphologies and physical appearance variations across growth stages and environments. To fill this gap, we present TPC-268, the first plant counting benchmark incorporating plant taxonomy. Our dataset couples instance-level point annotations with Linnaean labels (kingdom -> species) and organ categories, enabling hierarchical reasoning and species-aware evaluation. The dataset features 10,000 images with 678,050 point annotations, includes 268 countable plant categories over 242 plant species in Plantae and Fungi, and spans observation scales from canopy-level remote sensing imagery to tissue-level microscopy. We follow the problem setting of class-agnostic counting (CAC), provide taxonomy-consistent, scale-aware data splits, and benchmark state-of-the-art regression- and detection-based CAC approaches. By capturing the biodiversity, hierarchical structure, and multi-scale nature of botanical and mycological taxa, TPC-268 provides a biologically grounded testbed to advance fine-grained class-agnostic counting. Dataset and code are available at https://github.com/tiny-smart/TPC-268.

Read abstractHide abstract

0

Cascade-Free Mandarin Visual Speech Recognition via Semantic-Guided Cross-Representation Alignment

cs.CV Lei Yang, Yi He, Fei Wu et al. · Mar 23, 2026

This paper tackles Chinese Mandarin visual speech recognition (VSR),where the tonal nature of the language and large vocabulary make lipreading more challenging than for non-tonal languages like English. Existing approaches use cascade architectures with intermediate representations like pinyin to bridge the gap,but this introduces error accumulation and increases inference latency. The core idea is a cascade-free multitask architecture that jointly learns phoneme and viseme representations during training, with on-demand activation during inference for efficiency-accuracy trade-offs. This matters because cascade-free designs could eliminate error propagation while maintaining the benefits of intermediate representations.

Chinese mandarin visual speech recognition (VSR) is a task that has advanced in recent years, yet still lags behind the performance on non-tonal languages such as English. One primary challenge arises from the tonal nature of Mandarin, which limits the effectiveness of conventional sequence-to-sequence modeling approaches. To alleviate this issue, existing Chinese VSR systems commonly incorporate intermediate representations, most notably pinyin, within cascade architectures to enhance recognition accuracy. While beneficial, in these cascaded designs, the subsequent stage during inference depends on the output of the preceding stage, leading to error accumulation and increased inference latency. To address these limitations, we propose a cascade-free architecture based on multitask learning that jointly integrates multiple intermediate representations, including phoneme and viseme, to better exploit contextual information. The proposed semantic-guided local contrastive loss temporally aligns the features, enabling on-demand activation during inference, thereby providing a trade-off between inference efficiency and performance while mitigating error accumulation caused by projection and re-embedding. Experiments conducted on publicly available datasets demonstrate that our method achieves superior recognition performance.

Read abstractHide abstract

0

ReDiffuse: Rotation Equivariant Diffusion Model for Multi-focus Image Fusion

cs.CV Bo Li, Tingting Bao, Lingling Zhang et al. · Mar 22, 2026

Multi-focus image fusion (MFIF) combines source images from different focal planes into a single all-in-focus image. This paper targets a critical flaw in diffusion-based MFIF: defocus blur warps geometric structures, producing artifacts. The authors propose ReDiffuse, which embeds B-Conv (Fourier-series-based rotation-equivariant filters) into a U-Net diffusion backbone. By enforcing that rotations induce predictable feature transformations, the method aims to preserve edge orientation and structural consistency while reducing model size through parameter sharing.

Diffusion models have achieved impressive performance on multi-focus image fusion (MFIF). However, a key challenge in applying diffusion models to the ill-posed MFIF problem is that defocus blur can make common symmetric geometric structures (e.g., textures and edges) appear warped and deformed, often leading to unexpected artifacts in the fused images. Therefore, embedding rotation equivariance into diffusion networks is essential, as it enables the fusion results to faithfully preserve the original orientation and structural consistency of geometric patterns underlying the input images. Motivated by this, we propose ReDiffuse, a rotation-equivariant diffusion model for MFIF. Specifically, we carefully construct the basic diffusion architectures to achieve end-to-end rotation equivariance. We also provide a rigorous theoretical analysis to evaluate its intrinsic equivariance error, demonstrating the validity of embedding equivariance structures. ReDiffuse is comprehensively evaluated against various MFIF methods across four datasets (Lytro, MFFW, MFI-WHU, and Road-MF). Results demonstrate that ReDiffuse achieves competitive performance, with improvements of 0.28-6.64\% across six evaluation metrics. The code is available at https://github.com/MorvanLi/ReDiffuse.

Read abstractHide abstract

0

Parameter-efficient Prompt Tuning and Hierarchical Textual Guidance for Few-shot Whole Slide Image Classification

cs.CV Jayanie Bogahawatte, Sachith Seneviratne, Saman Halgamuge · Mar 23, 2026

Whole Slide Images (WSIs) present a unique challenge for computational pathology due to their gigapixel scale and the scarcity of annotated data. This paper addresses few-shot weakly supervised WSI classification (FSWC) by proposing HIPSS, which combines parameter-efficient prompt tuning via Scaling and Shifting Features (SSF) in the text encoder with a hierarchical textual guidance strategy for WSI representation learning. The core innovation replaces expensive cross-attention mechanisms with lightweight linear transformations $y = \gamma \cdot x + \beta$ while avoiding hard instance filtering through soft cosine-similarity-based attention refinement, achieving up to 13.8\% accuracy gains with 18.1\% fewer parameters than state-of-the-art methods.

Whole Slide Images (WSIs) are giga-pixel in scale and are typically partitioned into small instances in WSI classification pipelines for computational feasibility. However, obtaining extensive instance level annotations is costly, making few-shot weakly supervised WSI classification (FSWC) crucial for learning from limited slide-level labels. Recently, pre-trained vision-language models (VLMs) have been adopted in FSWC, yet they exhibit several limitations. Existing prompt tuning methods in FSWC substantially increase both the number of trainable parameters and inference overhead. Moreover, current methods discard instances with low alignment to text embeddings from VLMs, potentially leading to information loss. To address these challenges, we propose two key contributions. First, we introduce a new parameter efficient prompt tuning method by scaling and shifting features in text encoder, which significantly reduces the computational cost. Second, to leverage not only the pre-trained knowledge of VLMs, but also the inherent hierarchical structure of WSIs, we introduce a WSI representation learning approach with a soft hierarchical textual guidance strategy without utilizing hard instance filtering. Comprehensive evaluations on pathology datasets covering breast, lung, and ovarian cancer types demonstrate consistent improvements up-to 10.9%, 7.8%, and 13.8% respectively, over the state-of-the-art methods in FSWC. Our method reduces the number of trainable parameters by 18.1% on both breast and lung cancer datasets, and 5.8% on the ovarian cancer dataset, while also excelling at weakly-supervised tumor localization. Code at https://github.com/Jayanie/HIPSS.

Read abstractHide abstract

0

A Large-Scale Remote Sensing Dataset and VLM-based Algorithm for Fine-Grained Road Hierarchy Classification

cs.CV Ting Han, Xiangyi Xie, Yiping Chen et al. · Mar 22, 2026

Most road extraction benchmarks focus on binary segmentation, lacking the hierarchical attributes critical for transport infrastructure planning and management. This paper introduces SYSU-HiRoads, a large-scale dataset spanning 3,631 km² with aligned pixel masks, vector centerlines, and three-level road grades, alongside RoadReasoner—a framework that combines frequency-domain feature extraction with vision-language models to infer road hierarchy from geometric descriptors. The work bridges a significant gap in automated mapping by moving beyond "where are the roads" to "what roles do these roads play."

In this work, we present SYSU-HiRoads, a large-scale hierarchical road dataset, and RoadReasoner, a vision-language-geometry framework for automatic multi-grade road mapping from remote sensing imagery. SYSU-HiRoads is built from GF-2 imagery covering 3631 km2 in Henan Province, China, and contains 1079 image tiles at 0.8 m spatial resolution. Each tile is annotated with dense road masks, vectorized centerlines, and three-level hierarchy labels, enabling the joint training and evaluation of segmentation, topology reconstruction, and hierarchy classification. Building on this dataset, RoadReasoner is designed to generate robust road surface masks, topology-preserving road networks, and semantically coherent hierarchy assignments. We strengthen road feature representation and network connectivity by explicitly enhancing frequency-sensitive cues and multi-scale context. Moreover, we perform hierarchy inference at the skeleton-segment level with geometric descriptors and geometry-aware textual prompts, queried by vision-language models to obtain linguistically interpretable grade decisions. Experiments on SYSU-HiRoads and the CHN6-CUG dataset show that RoadReasoner surpasses state-of-the-art road extraction baselines and produces accurate and semantically consistent road hierarchy maps with 72.6% OA, 64.2% F1 score, and 60.6% SegAcc. The dataset and code will be publicly released to support automated transport infrastructure mapping, road inventory updating, and broader infrastructure management applications.

Read abstractHide abstract

0

Knowledge Priors for Identity-Disentangled Open-Set Privacy-Preserving Video FER

cs.CV Feng Xu, Xun Li, Lars Petersson et al. · Mar 22, 2026

This paper addresses privacy-preserving facial expression recognition (FER) in video without requiring identity labels—a critical gap since real-world deployment often lacks identity annotations. The core idea leverages intra- and inter-video knowledge priors to train an identity suppression network followed by a denoising module, enabling open-set privacy preservation. This matters because current methods either require closed-set identity supervision or suffer from entangled privacy-utility trade-offs that degrade performance.

Facial expression recognition relies on facial data that inherently expose identity and thus raise significant privacy concerns. Current privacy-preserving methods typically fail in realistic open-set video settings where identities are unknown, and identity labels are unavailable. We propose a two-stage framework for video-based privacy-preserving FER in challenging open-set settings that requires no identity labels at any stage. To decouple privacy and utility, we first train an identity-suppression network using intra- and inter-video knowledge priors derived from real-world videos without identity labels. This network anonymizes identity while preserving expressive cues. A subsequent denoising module restores expression-related information and helps recover FER performance. Furthermore, we introduce a falsification-based validation method that uses recognition priors to rigorously evaluate privacy robustness without requiring annotated identity labels. Experiments on three video datasets demonstrate that our method effectively protects privacy while maintaining FER accuracy comparable to identity-supervised baselines.

Read abstractHide abstract

0

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

cs.CV Mei Li, Huayi Zhou, Suizhi Huang et al. · Mar 23, 2026

The paper tackles semi-supervised 3D rotation regression from monocular images, addressing the rigidity of fixed entropy thresholds in pseudo-label filtering used by prior work like FisherMatch. It proposes HACMatch, a hardness-aware curriculum learning framework that dynamically selects unlabeled samples by difficulty using either multi-stage or adaptive strategies, paired with PoseMosaic, a patch-based augmentation that applies diverse transformations while preserving geometric integrity. This matters because rotation annotations are expensive to obtain, and effectively leveraging unlabeled data could reduce costs for autonomous driving and robotics applications.

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.

Read abstractHide abstract

0

Privacy-Preserving Federated Action Recognition via Differentially Private Selective Tuning and Efficient Communication

cs.CV Idris Zakariyya, Pai Chet Ng, Kaushik Bhargav Sivangi et al. · Mar 22, 2026

Federated video action recognition faces a dual challenge: gradient sharing risks leaking sensitive motion patterns, while synchronizing high-dimensional video models incurs prohibitive bandwidth costs. This paper proposes FedDP-STECAR, which selectively fine-tunes only task-relevant layers under differential privacy and transmits only those layers, claiming over 99% communication reduction alongside strong privacy guarantees ($\epsilon \leq 1.33$). The work matters for enabling practical privacy-preserving video analysis in healthcare and surveillance where data cannot be centralized.

Federated video action recognition enables collaborative model training without sharing raw video data, yet remains vulnerable to two key challenges: \textit{model exposure} and \textit{communication overhead}. Gradients exchanged between clients and the server can leak private motion patterns, while full-model synchronization of high-dimensional video networks causes significant bandwidth and communication costs. To address these issues, we propose \textit{Federated Differential Privacy with Selective Tuning and Efficient Communication for Action Recognition}, namely \textit{FedDP-STECAR}. Our \textit{FedDP-STECAR} framework selectively fine-tunes and perturbs only a small subset of task-relevant layers under Differential Privacy (DP), reducing the surface of information leakage while preserving temporal coherence in video features. By transmitting only the tuned layers during aggregation, communication traffic is reduced by over 99\% compared to full-model updates. Experiments on the UCF-101 dataset using the MViT-B-16x4 transformer show that \textit{FedDP-STECAR} achieves up to \textbf{70.2\% higher accuracy} under strict privacy ($\epsilon=0.65$) in centralized settings and \textbf{48\% faster training} with \textbf{73.1\% accuracy} in federated setups, enabling scalable and privacy-preserving video action recognition. Code available at https://github.com/izakariyya/mvit-federated-videodp

Read abstractHide abstract

Nothing here yet