Your paper timeline
Scroll AI takes the way you would scroll a great paper aggregator: quick signal first, deeper critique when something earns your attention, and challenges when a claim feels off.
482 papers
Trending mixes fresh papers with community signal.
0
cs.CV Pengxiang Cai, Mengyang Li · Mar 22, 2026

MS-CustomNet tackles multi-subject customization for text-to-image diffusion models, where the challenge is to preserve multiple subject identities while controlling their compositional arrangement and spatial relationships. The authors propose a framework built on CustomNet that accepts multiple reference images plus a layout map $M_L$ specifying spatial arrangement, trained on a curated MSI dataset derived from COCO. The work aims to provide explicit deterministic control over subject placement and layering (e.g., "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.

Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
0
cs.CV Lanbo Xu, Liang Guo, Caigui Jiang et al. · Mar 22, 2026

PAS3R tackles online monocular 3D reconstruction from long video streams, addressing the stability–adaptation dilemma where models must incorporate novel viewpoints without overwriting historical scene structure. The core idea is to dynamically modulate state update intensity based on geometric novelty: measuring inter-frame camera displacement (translation + rotation) and image frequency content via Fourier analysis. This enables faster adaptation to abrupt viewpoint changes while preserving accumulated geometry during smooth motion.

Online monocular 3D reconstruction enables dense scene recovery from streaming video but remains fundamentally limited by the stability-adaptation dilemma: the reconstruction model must rapidly incorporate novel viewpoints while preserving previously accumulated scene structure. Existing streaming approaches rely on uniform or attention-based update mechanisms that often fail to account for abrupt viewpoint transitions, leading to trajectory drift and geometric inconsistencies over long sequences. We introduce PAS3R, a pose-adaptive streaming reconstruction framework that dynamically modulates state updates according to camera motion and scene structure. Our key insight is that frames contributing significant geometric novelty should exert stronger influence on the reconstruction state, while frames with minor viewpoint variation should prioritize preserving historical context. PAS3R operationalizes this principle through a motion-aware update mechanism that jointly leverages inter-frame pose variation and image frequency cues to estimate frame importance. To further stabilize long-horizon reconstruction, we introduce trajectory-consistent training objectives that incorporate relative pose constraints and acceleration regularization. A lightweight online stabilization module further suppresses high-frequency trajectory jitter and geometric artifacts without increasing memory consumption. Extensive experiments across multiple benchmarks demonstrate that PAS3R significantly improves trajectory accuracy, depth estimation, and point cloud reconstruction quality in long video sequences while maintaining competitive performance on shorter sequences.
0
cs.CV Haolan Xu, Keli Cheng, Lei Wang et al. · Mar 22, 2026

EmoTaG tackles few-shot 3D talking-head synthesis with emotional expressiveness using only 5 seconds of target video. The core insight is to predict FLAME parameters (expression and jaw pose) rather than directly deforming 3D Gaussians, providing explicit geometric priors for stability. A Gated Residual Motion Network (GRMN) disentangles phonetic articulation from emotion-driven variations with a learned gate $g \in [0,1]$, while Semantic Emotion Guidance distills knowledge from a pretrained DeepFace recognizer to supervise emotional intensity without manual labels.

Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.
0
cs.CV Mohammed El Amine Lazouni, Leila Ryma Lazouni, Zineb Aziza Elaouaber et al. · Mar 22, 2026

CornOrb addresses a persistent gap in ophthalmic AI by providing one of the first large-scale, publicly accessible Orbscan 3 corneal topography datasets. The collection comprises 1,454 eyes from 744 Algerian patients, offering four standardized corneal maps (axial curvature, anterior/posterior elevation, pachymetry) alongside structured clinical parameters including Kmax, astigmatism, and asphericity. By releasing this multimodal resource in standardized PNG and CSV formats, the authors aim to enable robust AI-driven detection of keratoconus using device-specific data from an underrepresented African population.

In this paper, we present CornOrb, a publicly accessible multimodal dataset of Orbscan corneal topography images and clinical annotations collected from patients in Algeria. The dataset comprises 1,454 eyes from 744 patients, including 889 normal eyes and 565 keratoconus cases. For each eye, four corneal maps are provided (axial curvature, anterior elevation, posterior elevation, and pachymetry), together with structured tabular data including demographic information and key clinical parameters such as astigmatism, maximum keratometry (Kmax), central and thinnest pachymetry, and anterior/posterior asphericity. All data were retrospectively acquired, fully anonymized, and pre-processed into standardized PNG and CSV formats to ensure direct usability for artificial intelligence research. This dataset represents one of the first large-scale Orbscan-based resources from Africa, specifically built to enable robust AI-driven detection and analysis of keratoconus using multimodal data. The data are openly available at Zenodo.
0
cs.CV Jiacheng Lu, Hui Ding, Shiyu Zhang et al. · Mar 23, 2026

PGR-Net addresses brain tumor MRI segmentation by tackling the challenge of spatial sparsity—where lesions occupy only ~10.7% of the image volume—through explicit data-driven spatial priors. The framework introduces a hierarchical Top-K ROI selection mechanism and a Windowed Gaussian–Spatial Decay (WinGS-ROI) module to concentrate computational resources on lesion-relevant regions rather than background. This yields competitive Dice scores (89.02–91.82% on Whole Tumor across benchmarks) with only 8.64M parameters, offering a lightweight alternative to contemporary Transformer and Mamba architectures.

Brain tumor MRI segmentation is essential for clinical diagnosis and treatment planning, enabling accurate lesion detection and radiotherapy target delineation. However, tumor lesions occupy only a small fraction of the volumetric space, resulting in severe spatial sparsity, while existing segmentation networks often overlook clinically observed spatial priors of tumor occurrence, leading to redundant feature computation over extensive background regions. To address this issue, we propose PGR-Net (Prior-Guided ROI Reasoning Network) - an explicit ROI-aware framework that incorporates a data-driven spatial prior set to capture the distribution and scale characteristics of tumor lesions, providing global guidance for more stable segmentation. Leveraging these priors, PGR-Net introduces a hierarchical Top-K ROI decision mechanism that progressively selects the most confident lesion candidate regions across encoder layers to improve localization precision. We further develop the WinGS-ROI (Windowed Gaussian-Spatial Decay ROI) module, which uses multi-window Gaussian templates with a spatial decay function to produce center-enhanced guidance maps, thus directing feature learning throughout the network. With these ROI features, a windowed RetNet backbone is adopted to enhance localization reliability. Experiments on BraTS-2019/2023 and MSD Task01 show that PGR-Net consistently outperforms existing approaches while using only 8.64M Params, achieving Dice scores of 89.02%, 91.82%, and 89.67% on the Whole Tumor region. Code is available at https://github.com/CNU-MedAI-Lab/PGR-Net.
0
cs.CVcs.RO Suresh Guttikonda, Maximilian Neidhardt, Vidas Raudonis et al. · Mar 23, 2026

This paper tackles robotic optical coherence tomography (OCT) scanning of curved tissue surfaces, addressing the limitation that existing approaches restrict motion to pure translations to avoid challenging hand-eye calibration. The core contribution is a custom ChArUco calibration pattern enabling full six-degree-of-freedom hand-eye calibration, allowing the OCT probe to rotate and follow curved surfaces. This matters because pure translational scanning accumulates registration errors on curved geometries, whereas full 6D motion enables accurate, large-area surface reconstruction.

Optical coherence tomography (OCT) is a non-invasive volumetric imaging modality with high spatial and temporal resolution. For imaging larger tissue structures, OCT probes need to be moved to scan the respective area. For handheld scanning, stitching of the acquired OCT volumes requires overlap to register the images. For robotic scanning and stitching, a typical approach is to restrict the motion to translations, as this avoids a full hand-eye calibration, which is complicated by the small field of view of most OCT probes. However, stitching by registration or by translational scanning are limited when curved tissue surfaces need to be scanned. We propose a marker for full six-dimensional hand-eye calibration of a robot mounted OCT probe. We show that the calibration results in highly repeatable estimates of the transformation. Moreover, we evaluate robotic scanning of two phantom surfaces to demonstrate that the proposed calibration allows for consistent scanning of large, curved tissue surfaces. As the proposed approach is not relying on image registration, it does not suffer from a potential accumulation of errors along a scan path. We also illustrate the improvement compared to conventional 3D-translational robotic scanning.
0
cs.CV Roy Amoyal, Oren Freifeld, Chaim Baskin · Mar 23, 2026

The paper addresses the novel challenge of aligning independent 3D Gaussian Splatting models across different object instances within the same category—a task beyond existing same-object registration methods. The core innovation is a two-stage pipeline: first, a coarse alignment using a feature-guided iterative absolute orientation solver that handles extreme initializations (180° rotations, 10× scale differences); second, a fine alignment that enforces multi-view feature consistency via an inverse-radiance-field formulation generalized to the similarity group $\text{Sim}(3)$. This enables the first viable category-level 3DGS registration, unlocking applications like geometrically-consistent object replacement.

We present Gaussian Splatting Alignment (GSA), a novel method for aligning two independent 3D Gaussian Splatting (3DGS) models via a similarity transformation (rotation, translation, and scale), even when they are of different objects in the same category (e.g., different cars). In contrast, existing methods can only align 3DGS models of the same object (e.g., the same car) and often must be given true scale as input, while we estimate it successfully. GSA leverages viewpoint-guided spherical map features to obtain robust correspondences and introduces a two-step optimization framework that aligns 3DGS models while keeping them fixed. First, we apply an iterative feature-guided absolute orientation solver as our coarse registration, which is robust to poor initialization (e.g., 180 degrees misalignment or a 10x scale gap). Next, we use a fine registration step that enforces multi-view feature consistency, inspired by inverse radiance-field formulations. The first step already achieves state-of-the-art performance, and the second further improves results. In the same-object case, GSA outperforms prior works, often by a large margin, even when the other methods are given the true scale. In the harder case of different objects in the same category, GSA vastly surpasses them, providing the first effective solution for category-level 3DGS registration and unlocking new applications. Project webpage: https://bgu-cs-vil.github.io/GSA-project/
0
cs.LG Kangqi Ni, Wenyue Hua, Xiaoxiang Shi et al. · Mar 23, 2026

Multi-agent applications execute tasks through multi-stage workflows where each stage is an LLM call feeding into the next. While heterogeneous clusters (mixing model sizes/families) enable better latency–performance trade-offs than homogeneous deployments, they introduce complex scheduling challenges: model selection affects both task accuracy and queue congestion. Chimera addresses this by predicting per-model confidence scores, forecasting total workflow output lengths, and estimating real-time load via in-flight token volumes to jointly optimize end-to-end latency and task performance.

Multi-agent applications often execute complex tasks as multi-stage workflows, where each stage is an LLM call whose output becomes part of context for subsequent steps. Existing LLM serving systems largely assume homogeneous clusters with identical model replicas. This design overlooks the potential of heterogeneous deployments, where models of different sizes and capabilities enable finer trade-offs between latency and performance. However, heterogeneity introduces new challenges in scheduling across models with diverse throughput and performance. We present Chimera, a predictive scheduling system for multi-agent workflow serving on heterogeneous LLM clusters that jointly improves end-to-end latency and task performance. Chimera applies semantic routing to estimate per-model confidence scores for each request, predicts the total remaining output length of the workflow, and estimates per-model congestion using in-flight predicted token volumes for load balancing. We evaluate Chimera on representative agentic workflows for code generation and math reasoning using multiple heterogeneous LLM configurations. Across comparable settings, Chimera traces the best latency-performance frontier, reducing end-to-end latency by 1.2--2.4$\times$ and improving task performance by 8.0-9.5 percentage points on average over competitive baselines including vLLM.
0
cs.CV Haixi Zhang, Aiyinsi Zuo, Zirui Li et al. · Mar 22, 2026

This paper presents LRHPerception, a unified monocular perception package that addresses the computational burden of multi-camera autonomous driving pipelines by integrating object tracking, trajectory prediction, road segmentation, and depth estimation into a single real-time system processing at 29 FPS on one GPU. The core innovation lies in sharing a Swin Transformer backbone across modules while introducing task-specific optimizations like C-BYTE tracking with camera-motion compensation and a coarse-to-fine depth estimator. This matters because it offers an interpretable middle ground between black-box end-to-end driving and expensive bird's-eye-view mapping systems.

Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.
0
cs.CV Kaiqiang Li, Gang Li, Mingle Zhou et al. · Mar 23, 2026

Zero-shot 3D anomaly detection enables industrial inspection without target-category training data, but existing methods discard geometric details by projecting point clouds to 2D images. This paper proposes BTP (Back To Point), the first framework to apply pre-trained Point-Language Models directly on 3D point clouds. By aligning multi-granularity patch features with text embeddings and incorporating geometric descriptors, BTP achieves fine-grained anomaly localization while avoiding view-dependent projection artifacts.

Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.
0
cs.CV Zelin Liu, Xiangfu Yu, Jie Huang et al. · Mar 23, 2026

Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors with 15–25% metastatic risk and poor survival. Manual GAPP scoring for metastatic risk is labor-intensive and subjective, while critical genotype information (e.g., SDHB mutations conferring 35–75% metastatic risk) is often missed in clinical practice. This paper introduces PPGL-Swarm, an agentic diagnostic system that decomposes diagnosis into specialized WSI, gene, and table agents coordinated via reinforcement learning to automate GAPP scoring, predict hereditary mutations (SDHB/VHL/RET) from histology alone, and generate auditable multimodal reports grounded in a structured knowledge graph.

Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors, of which 15-25% develop metastatic disease with 5-year survival rates reported as low as 34%. PPGL may indicate hereditary syndromes requiring stricter, syndrome-specific treatment and surveillance, but clinicians often fail to recognize these associations in routine care. Clinical practice uses GAPP score for PPGL grading, but several limitations remain for PPGL diagnosis: (1) GAPP scoring demands a high workload for clinician because it requires the manual evaluation of six independent components; (2) key components such as cellularity and Ki-67 are often evaluated with subjective criteria; (3) several clinically relevant metastatic risk factors are not captured by GAPP, such as SDHB mutations, which have been associated with reported metastatic rates of 35-75%. Agent-driven diagnostic systems appear promising, but most lack traceable reasoning for decision-making and do not incorporate domain-specific knowledge such as PPGL genotype information. To address these limitations, we present PPGL-Swarm, an agentic PPGL diagnostic system that generates a comprehensive report, including automated GAPP scoring (with quantified cellularity and Ki-67), genotype risk alerts, and multimodal report with integrated evidence. The system provides an auditable reasoning trail by decomposing diagnosis into micro-tasks, each assigned to a specialized agent. The gene and table agents use knowledge enhancement to better interpret genotype and laboratory findings, and during training we use reinforcement learning to refine tool selection and task assignment.
0
cs.CV Sopitta Thurachen, Josef Taher, Matti Lehtom\"aki et al. · Mar 23, 2026

Accurate riverine land cover mapping is essential for river management but challenging due to water penetration issues in 2D imagery and complex 3D structure. This paper applies Point Transformer v2 (PTv2)—using grouped vector attention and partition-based pooling—to multispectral LiDAR point clouds (1550 nm, 905 nm, 532 nm) for semantic segmentation of six land cover classes in Finnish river environments. The authors demonstrate that spectral features (particularly intensity and reflectance) combined with geometric data achieve $0.950$ mean IoU, and propose multi-dataset training with sparse annotations to improve cross-site generalization despite severe class imbalance.

Accurate land cover mapping in riverine environments is essential for effective river management, ecological understanding, and geomorphic change monitoring. This study explores the use of Point Transformer v2 (PTv2), an advanced deep neural network architecture designed for point cloud data, for land cover mapping through semantic segmentation of multispectral LiDAR data in real-world riverine environments. We utilize the geometric and spectral information from the 3-channel LiDAR point cloud to map land cover classes, including sand, gravel, low vegetation, high vegetation, forest floor, and water. The PTv2 model was trained and evaluated on point cloud data from the Oulanka river in northern Finland using both geometry and spectral features. To improve the model's generalization in new riverine environments, we additionally investigate multi-dataset training that adds sparsely annotated data from an additional river dataset. Results demonstrated that using the full-feature configuration resulted in performance with a mean Intersection over Union (mIoU) of 0.950, significantly outperforming the geometry baseline. Other ablation studies revealed that intensity and reflectance features were the key for accurate land cover mapping. The multi-dataset training experiment showed improved generalization performance, suggesting potential for developing more robust models despite limited high-quality annotated data. Our work demonstrates the potential of applying transformer-based architectures to multispectral point clouds in riverine environments. The approach offers new capabilities for monitoring sediment transport and other river management applications.
0
cs.CLcs.DB Lars Vogt · Mar 23, 2026

The paper tackles the 'semantic parsing burden'—the effort required to translate natural language into structured RDF/OWL representations for knowledge graphs. It proposes the Semantic Ladder, a five-level framework ($L_1$ to $L_5$) enabling progressive formalization from raw text snippets to higher-order logic. By introducing Rosetta Statements as semantic anchors and emphasizing modular semantic units, the work aims to lower barriers to knowledge graph construction while maintaining semantic continuity.

Semantic data and knowledge infrastructures must reconcile two fundamentally different forms of representation: natural language, in which most knowledge is created and communicated, and formal semantic models, which enable machine-actionable integration, interoperability, and reasoning. Bridging this gap remains a central challenge, particularly when full semantic formalization is required at the point of data entry. Here, we introduce the Semantic Ladder, an architectural framework that enables the progressive formalization of data and knowledge. Building on the concept of modular semantic units as identifiable carriers of meaning, the framework organizes representations across levels of increasing semantic explicitness, ranging from natural language text snippets to ontology-based and higher-order logical models. Transformations between levels support semantic enrichment, statement structuring, and logical modelling while preserving semantic continuity and traceability. This approach enables the incremental construction of semantic knowledge spaces, reduces the semantic parsing burden, and supports the integration of heterogeneous representations, including natural language, structured semantic models, and vector-based embeddings. The Semantic Ladder thereby provides a foundation for scalable, interoperable, and AI-ready data and knowledge infrastructures.
0
cs.CV Faisal Ahmed · Mar 22, 2026

This paper proposes applying Vision Transformers with colormap-based pseudo-color enhancement to brain tumor classification on the BRISC2025 MRI dataset. The core idea wraps a standard ViT-Base model with a Jet colormap preprocessing step to boost contrast, claiming 98.90% accuracy on four-class tumor classification. While the technique is sound in principle, serious copy-paste errors indicate the manuscript was likely templated from the author's prior Alzheimer's work without adequate revision.

Accurate classification of brain tumors from magnetic resonance imaging (MRI) plays a critical role in early diagnosis and effective treatment planning. In this study, we propose a deep learning framework based on Vision Transformers (ViT) enhanced with colormap-based feature representation to improve multi-class brain tumor classification performance. The proposed approach leverages the ability of transformer architectures to capture long-range dependencies while incorporating color mapping techniques to emphasize important structural and intensity variations within MRI scans. Experiments are conducted on the BRISC2025 dataset, which includes four classes: glioma, meningioma, pituitary tumor, and non-tumor cases. The model is trained and evaluated using standard performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The proposed method achieves a classification accuracy of 98.90%, outperforming baseline convolutional neural network models including ResNet50, ResNet101, and EfficientNetB2. In addition, the model demonstrates strong generalization capability with an AUC of 99.97%, indicating high discriminative performance across all classes. These results highlight the effectiveness of combining Vision Transformers with colormap-based feature enhancement for accurate and robust brain tumor classification and suggest strong potential for clinical decision support applications.
0
cs.CV Wenhan Wu, Zhishuai Guo, Chen Chen et al. · Mar 22, 2026

Stochastic human motion prediction often suffers from high-frequency jitter and physically implausible poses. This paper proposes KHMP, a framework that combines training-time physical constraints (temporal smoothness and joint angle limits) with a novel inference-time refinement: an adaptive Kalman filter operating in the DCT frequency domain. The key innovation treats high-frequency DCT coefficients as a frequency-indexed noisy signal, recursively filtering them with parameters dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR).

Stochastic human motion prediction aims to generate diverse, plausible futures from observed sequences. Despite advances in generative modeling, existing methods often produce predictions corrupted by high-frequency jitter and temporal discontinuities. To address these challenges, we introduce KHMP, a novel framework featuring an adaptiveKalman filter applied in the DCT domain to generate high-fidelity human motion predictions. By treating high-frequency DCT coefficients as a frequency-indexed noisy signal, the Kalman filter recursively suppresses noise while preserving motion details. Notably, its noise parameters are dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR), enabling aggressive denoising for jittery predictions and conservative filtering for clean motions. This refinement is complemented by training-time physical constraints (temporal smoothness and joint angle limits) that encode biomechanical principles into the generative model. Together, these innovations establish a new paradigm integrating adaptive signal processing with physics-informed learning. Experiments on the Human3.6M and HumanEva-I datasets demonstrate that KHMP achieves state-of-the-art accuracy, effectively mitigating jitter artifacts to produce smooth and physically plausible motions.
0
cs.CVcs.LG Shenghan Zhang, Run Ling, Ke Cao et al. · Mar 23, 2026

This paper addresses federated learning for cross-view video understanding, where heterogeneous camera viewpoints create highly non-IID client distributions that impede generalization to unseen views. FedCVU proposes three complementary modules: VS-Norm preserves client-specific normalization statistics to handle view-dependent feature shifts; CV-Align introduces lightweight prototype-based contrastive learning to align representations across cameras; and SLA employs selective layer aggregation to reduce communication overhead by 40–45%. The work targets an important practical scenario—privacy-preserving multi-camera surveillance where centralizing raw footage is infeasible.

Federated learning (FL) has emerged as a promising paradigm for privacy-preserving multi-camera video understanding. However, applying FL to cross-view scenarios faces three major challenges: (i) heterogeneous viewpoints and backgrounds lead to highly non-IID client distributions and overfitting to view-specific patterns, (ii) local distribution biases cause misaligned representations that hinder consistent cross-view semantics, and (iii) large video architectures incur prohibitive communication overhead. To address these issues, we propose FedCVU, a federated framework with three components: VS-Norm, which preserves normalization parameters to handle view-specific statistics; CV-Align, a lightweight contrastive regularization module to improve cross-view representation alignment; and SLA, a selective layer aggregation strategy that reduces communication without sacrificing accuracy. Extensive experiments on action understanding and person re-identification tasks under a cross-view protocol demonstrate that FedCVU consistently boosts unseen-view accuracy while maintaining strong seen-view performance, outperforming state-of-the-art FL baselines and showing robustness to domain heterogeneity and communication constraints.
0
cs.CV Valentin Wagner, Sebastian Bullinger, Michael Arens et al. · Mar 23, 2026

SatGeo-NeRF addresses wave-like geometric artifacts in satellite neural radiance fields caused by overfitting to multi-temporal imagery with varying lighting and transient objects. The paper proposes three model-agnostic regularizers—gravity-aligned planarity, coarse-to-fine granularity masking, and depth supervision—to stabilize geometry learning. Experiments on the DFC2019 benchmark report 14% lower mean altitude error relative to prior work, though this comparison relies on a reimplemented baseline that underperforms the original reported scores.

We present SatGeo-NeRF, a geometrically regularized NeRF for satellite imagery that mitigates overfitting-induced geometric artifacts observed in current state-of-the-art models using three model-agnostic regularizers. Gravity-Aligned Planarity Regularization aligns depth-inferred, approximated surface normals with the gravity axis to promote local planarity, coupling adjacent rays via a corresponding surface approximation to facilitate cross-ray gradient flow. Granularity Regularization enforces a coarse-to-fine geometry-learning scheme, and Depth-Supervised Regularization stabilizes early training for improved geometric accuracy. On the DFC2019 satellite reconstruction benchmark, SatGeo-NeRF improves the Mean Altitude Error by 13.9% and 11.7% relative to state-of-the-art baselines such as EO-NeRF and EO-GS.
0
cs.CL Ajan Subramanian, Sumukh Bettadapura, Rohan Sathish · Mar 23, 2026

As consumer-grade EEG headphones enter the market, a critical question emerges: can language models adapt to your specific neural signature? This paper demonstrates that frozen LLMs already contain person-specific linear directions in their activation spaces that predict individual brain activity during reading, achieving a ninefold improvement over population averages. The findings suggest that deep neural networks encode stable, individual cognitive fingerprints that could enable future brain-computer interfaces to personalize AI to the user wearing the headset.

Consumer-grade EEG is entering everyday devices, from earbuds to headbands, raising the question of whether language models can be adapted to individual neural responses. We test this by asking whether frozen LLM representations encode person-specific EEG signals, directions in activation space that predict one person's brain activity but not another's. Using word-level EEG from 30 participants reading naturalistic sentences (ZuCo corpus), we train a separate linear probe for each person, mapping hidden states from a frozen Qwen 2.5 7B to that individual's EEG power. Person-specific probes outperform a single population probe on every EEG feature tested; for high-gamma power, the person-specific probe achieves rho = 0.183, a ninefold improvement over the population probe (rho = 0.020, p < 10^-4). A negative control, fixation count, shows no person-specific advantage (p = 0.360); fixation count reflects word length and frequency rather than individual cognition. The individual directions are temporally stable (split-half cosine = 0.824), non-transferable across people (self rho = 0.369 vs. other rho = 0.143, p < 10^-19), and distinct from the shared population signal: person-specific probes retain predictive power after the population component is removed. The person-specific signal concentrates in the model's deep layers, rising consistently with depth and peaking at Layer 24 of 28. The results are consistent across architectures (LLaMA 3.1 8B) and survive word-level confound controls. Frozen language models contain stable, person-specific neural directions in their deep layers, providing a geometric foundation for EEG-driven personalization.
0
cs.LGquant-ph Oscar Novo, Oscar Bastidas-Jossa, Alberto Calvo et al. · Mar 23, 2026

This paper investigates whether domain knowledge for quantum code generation should be embedded in model parameters through fine-tuning or provided at inference time via retrieval and agents. Comparing a parameter-specialized Granite-20B baseline against modern general-purpose LLMs (OpenAI, Claude, Gemini) on the Qiskit-HumanEval benchmark, the authors find that inference-time augmentation—particularly agentic execution feedback—outperforms fine-tuning by over 35 percentage points, offering a more maintainable path as quantum SDKs evolve.

Recent advances in large language models (LLMs) have enabled the automation of an increasing number of programming tasks, including code generation for scientific and engineering domains. In rapidly evolving software ecosystems such as quantum software development, where frameworks expose complex abstractions, a central question is how best to incorporate domain knowledge into LLM-based assistants while preserving maintainability as libraries evolve. In this work, we study specialization strategies for Qiskit code generation using the Qiskit-HumanEval benchmark. We compare a parameter-specialized fine-tuned baseline introduced in prior work against a range of recent general-purpose LLMs enhanced with retrieval-augmented generation (RAG) and agent-based inference with execution feedback. Our results show that modern general-purpose LLMs consistently outperform the parameter-specialized baseline. While the fine-tuned model achieves approximately 47% pass@1 on Qiskit-HumanEval, recent general-purpose models reach 60-65% under zero-shot and retrieval-augmented settings, and up to 85% for the strongest evaluated model when combined with iterative execution-feedback agents -representing an improvement of more than 20% over zero-shot general-purpose performance and more than 35% over the parameter-specialized baseline. Agentic execution feedback yields the most consistent improvements, albeit at increased runtime cost, while RAG provides modest and model-dependent gains. These findings indicate that performance gains can be achieved without domain-specific fine-tuning, instead relying on inference-time augmentation, thereby enabling a more flexible and maintainable approach to LLM-assisted quantum software development.
0
cs.CV Alex Salvatierra, Jos\'e Antonio Sanz, Christian Guti\'errez et al. · Mar 23, 2026

This paper benchmarks four deep learning architectures (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for aerial LiDAR semantic segmentation under real operational flight conditions in Navarre, Spain. The study addresses a critical gap in evaluating models on heterogeneous aerial data with severe class imbalance (vehicles at 0.68%, low vegetation at 1.41%), finding that while all models exceed 93% overall accuracy, mean IoU ranges from 71.98% to 78.51% with persistent failures on minority classes.

Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.