Nothing here yet
Reconstructing translucent objects from multi-view images is challenging because subsurface scattering causes standard surface reconstruction methods to fail. This paper proposes GTSR, a 3D Gaussian Splatting (3DGS) pipeline that separates surface geometry from scattering effects by using two Gaussian sets—surface Gaussians for geometry and interior Gaussians for scattering—blended via a Fresnel term. A physically-based rendering (PBR) module with deferred shading further constrains the geometry. The method achieves state-of-the-art surface reconstruction on the NeuralTO Syn dataset while training in approximately 2.5 hours, significantly faster than prior neural implicit approaches.
The paper addresses continual unlearning in Large Vision-Language Models (LVLMs), where models must sequentially remove specific vision-instruction pairs without full retraining while preserving general utility. Prior methods suffer from distorted shared representations that create spurious associations, leading to irrelevant refusals for past forget data and over-refusal of retain queries. The proposed framework, CORE (COncept-aware REfuser), decomposes deletion targets into fine-grained visual attributes and textual intents, using a concept modulator to identify which combinations characterize each forget category and a mixture of specialized refusal experts to generate contextually appropriate refusals.
ShapDBM addresses the fragmentation problem in Decision Boundary Maps (DBMs) by transforming data into Shapley space before applying dimensionality reduction. This creates more compact decision zones that reflect model behavior rather than raw data distribution, enabling high-quality visualization of complex datasets like SVHN where traditional data-space DBMs fail.
Burst image restoration in low-light conditions typically relies on fixed exposure settings that limit complementary information across frames. This paper proposes DEBIR, a pipeline that dynamically predicts per-frame exposure times using a Burst Auto-Exposure Network (BAENet) conditioned on preview images, motion, and gain. The key insight is that scene-adaptive exposures can optimally trade off noise and blur across the burst, and the authors enable end-to-end training via a novel differentiable burst simulator that eliminates the need for ground-truth exposure sequences.
DA-VAE tackles the challenge of scaling latent diffusion models to higher resolutions without linearly increasing token counts. The core idea is a structured latent representation: keep the original pretrained VAE latent channels as a 'base' and append additional 'detail' channels that encode high-resolution information, enforced by a simple alignment loss. This allows a pretrained diffusion model to be fine-tuned rather than retrained from scratch, promising significant compute savings.
Diffusion Language Models (DLMs) train with a static single-step masked prediction objective but infer via multi-step progressive denoising, creating a train-inference mismatch that compounds errors. MemDLM bridges this gap through Bi-level Optimization: an inner loop updates fast weights (Parametric Memory) to capture local trajectory experience, while an outer loop conditions the base model on this memory. The approach yields faster convergence, lower exposure bias, and substantial gains on long-context needle-in-a-haystack tasks, with an optional inference-time adaptation that acts as an emergent in-weight retrieval mechanism.
FeatDistill tackles robust detection of AI-generated images under real-world degradations via a multi-expert ensemble of CLIP and SigLIP backbones. The framework combines extensive data expansion with a two-stage training paradigm featuring feature-level self-distillation. It aims to balance strong generalization across unseen generators with practical inference efficiency.
This paper tackles coronary artery segmentation from CTA images, a challenging task due to slender tubular morphology and severe class imbalance. The authors propose MDSVM-UNet, a two-stage framework that combines multidirectional snake convolution (MDSConv)—extending deformable convolution to three anatomical planes—with residual visual Mamba (RVM) for linear-complexity long-range dependency modeling. The approach aims to capture both local geometric priors of vessels and global inter-slice context while maintaining computational efficiency suitable for clinical deployment.
This paper tackles a fundamental question in multimodal large language models (MLLMs): should the vision encoder be fine-tuned or frozen during instruction tuning? The authors identify visual preference conflicts—where diverse linguistic instructions pull encoder parameters in conflicting directions—as the root cause of instability in existing visual fine-tuning (VFT) methods. They propose CoVFT, a context-aware framework that extracts multimodal context vectors and routes visual tokens through mixture-of-experts layers to decompose these conflicts, achieving consistent gains across 12 benchmarks.
Federated learning for medical imaging typically requires task-specific pipelines and assumes homogeneous modalities across institutions, limiting real-world deployment where hospitals use diverse scanners (MRI, CT, PET) and need to support multiple downstream tasks. OmniFM proposes a frequency-domain insight: low-frequency spectral components exhibit cross-modality consistency and encode modality-invariant anatomical structures, enabling a single reusable optimization pipeline. The framework combines Global Spectral Knowledge Retrieval, Embedding-wise Cross-Attention Fusion, and Prefix-Suffix Spectral Prompting, regularized by Spectral-Proximal Alignment to stabilize aggregation under severe modality heterogeneity.
Selective prediction systems in LLMs abstain from answering uncertain questions to mitigate hallucination harms in high-stakes domains. This paper identifies a critical failure mode of entropy-based uncertainty quantification: the 'confidently wrong' regime where models produce low-entropy hallucinations. The authors propose combining entropy signals with correctness probes using logistic regression, and advocate for deployment-facing metrics—E-AURC and TCE—over AUROC to ensure systems can reliably operate at strict safety thresholds.
This paper proposes SparseVoxelDet, the first fully sparse object detector for event cameras that processes asynchronous event data using 3D sparse convolutions throughout the entire pipeline—from voxelization through backbone, feature pyramid, and detection head—without ever instantiating a dense feature tensor. On the FRED drone detection benchmark, the model achieves 83.38% mAP@50 (within 4.3 points of the dense YOLOv11 baseline) while processing only ~14,900 active voxels per frame (0.23% occupancy at 640×640) instead of all 409,600 pixel positions, yielding 858× GPU memory compression and storage costs that scale with scene activity rather than sensor resolution.
PEARL tackles training-free open-vocabulary semantic segmentation (OVSS), where the goal is to segment images into classes defined by arbitrary text prompts without fine-tuning the vision-language backbone. The core idea is an align-then-propagate pipeline: (1) Procrustes alignment rotates attention keys toward the query subspace inside the last self-attention block to fix spatially inconsistent patch geometry, and (2) a text-aware Laplacian propagation refines logits on a compact grid using a confidence-weighted graph that couples image gradients with text-based semantic similarity. This matters because it delivers state-of-the-art training-free accuracy with a frozen CLIP encoder, adding only modest computational overhead.
DUO-VSR tackles the prohibitive sampling cost of diffusion-based video super-resolution by enabling efficient one-step generation. The paper identifies critical limitations when applying Distribution Matching Distillation (DMD) to VSR—specifically training instability, degraded supervision from frozen score models, and insufficient guidance capped by teacher quality—and proposes a dual-stream strategy that unifies DMD with adversarial supervision via Real–Fake Score Feature GAN (RFS-GAN). This three-stage pipeline achieves approximately $50\times$ speedup over multi-step counterparts while delivering superior perceptual quality, making high-fidelity video upscaling practical for real-world deployment.
ALADIN tackles person Re-identification by distilling fine-grained attribute knowledge from a frozen CLIP teacher into a lightweight student network. The core innovation uses a Multimodal LLM (Qwen-VL) to generate structured attribute descriptions, which are converted via CLIP into spatial attention maps for supervising local feature alignment. A Scene-Aware Prompt Generator (SAPG) creates image-specific soft prompts via $\mathbf{p}=\mathrm{MLP}(\mathbf{f}_{g})$ to adapt text embeddings to surveillance scenes. At inference, only the student runs, promising deployable efficiency.
This paper proposes a fundamental shift in evaluating probabilistic time series forecasting by replacing passive observation of historical trajectories with an interventionist "noise titration" protocol. By injecting calibrated Gaussian noise into known chaotic and stochastic dynamical systems, the authors transform forecasting into an exact distributional inference task where statistical calibration can be verified against ground-truth likelihoods. They extend the Fern architecture to output full covariance structures via SPD cone parameterization, then use the framework to expose severe failures in zero-shot foundation models under non-stationarity.
Ara-BEST-RQ introduces dedicated self-supervised speech models for Arabic dialects. The authors curate 5,640 hours of Creative Commons Arabic speech covering 20 dialects and train Conformer-based BEST-RQ models up to 600M parameters. Their 300M model achieves state-of-the-art dialect identification performance using fewer parameters than competing Whisper-based systems. This work helps close the gap for underrepresented Arabic dialects in speech technology.
Traditional concentration indices like the Herfindahl-Hirschman Index ($HHI = \sum_i w_i^2$) measure weight dispersion but ignore network topology, meaning two systems with identical weight distributions can exhibit different effective concentration. This paper introduces the Network Concentration Index (NCI), defined as $\psi(w,A) = \frac{w^{\top}Aw}{1-\sum_i w_i^2}$, which measures the fraction of potential weighted interconnection realized along observed network links. The framework unifies weight distributions with interaction structures, providing a theoretically grounded tool for assessing systemic risk in financial networks, supply chains, and economic production systems.
NoOVD tackles a critical issue in open-vocabulary object detection (OVD): during training, novel-category objects are forcibly aligned with background embeddings, causing them to be filtered out by the RPN and misclassified by the RoI head. The authors propose a framework built on frozen CLIP that identifies latent novel objects during training via generic text prompts (e.g., 'This is an object, specifically an animal') and integrates them through self-distillation. At test time, a Re-weighted RPN (R-RPN) boosts proposal scores using CLIP-based knowledge to improve novel-category recall. The method aims to eliminate the training-inference gap without requiring additional labeled data or pseudo-labeling noise.
MIHT tackles Time Series Classification (TSC) with variable-length, multivariate data—common in sensor and healthcare applications. The core idea combines Multiple Instance Learning (MIL) with Hoeffding Trees (incremental decision trees) to represent series as overlapping subseries bags and iteratively optimize which $k$ consecutive subseries are most discriminative. The approach promises both handling of unequal-length inputs and interpretability via a single tree structure.