Nothing here yet
CornOrb addresses a persistent gap in ophthalmic AI by providing one of the first large-scale, publicly accessible Orbscan 3 corneal topography datasets. The collection comprises 1,454 eyes from 744 Algerian patients, offering four standardized corneal maps (axial curvature, anterior/posterior elevation, pachymetry) alongside structured clinical parameters including Kmax, astigmatism, and asphericity. By releasing this multimodal resource in standardized PNG and CSV formats, the authors aim to enable robust AI-driven detection of keratoconus using device-specific data from an underrepresented African population.
PGR-Net addresses brain tumor MRI segmentation by tackling the challenge of spatial sparsity—where lesions occupy only ~10.7% of the image volume—through explicit data-driven spatial priors. The framework introduces a hierarchical Top-K ROI selection mechanism and a Windowed Gaussian–Spatial Decay (WinGS-ROI) module to concentrate computational resources on lesion-relevant regions rather than background. This yields competitive Dice scores (89.02–91.82% on Whole Tumor across benchmarks) with only 8.64M parameters, offering a lightweight alternative to contemporary Transformer and Mamba architectures.
This paper tackles robotic optical coherence tomography (OCT) scanning of curved tissue surfaces, addressing the limitation that existing approaches restrict motion to pure translations to avoid challenging hand-eye calibration. The core contribution is a custom ChArUco calibration pattern enabling full six-degree-of-freedom hand-eye calibration, allowing the OCT probe to rotate and follow curved surfaces. This matters because pure translational scanning accumulates registration errors on curved geometries, whereas full 6D motion enables accurate, large-area surface reconstruction.
The paper addresses the novel challenge of aligning independent 3D Gaussian Splatting models across different object instances within the same category—a task beyond existing same-object registration methods. The core innovation is a two-stage pipeline: first, a coarse alignment using a feature-guided iterative absolute orientation solver that handles extreme initializations (180° rotations, 10× scale differences); second, a fine alignment that enforces multi-view feature consistency via an inverse-radiance-field formulation generalized to the similarity group $\text{Sim}(3)$. This enables the first viable category-level 3DGS registration, unlocking applications like geometrically-consistent object replacement.
This paper presents LRHPerception, a unified monocular perception package that addresses the computational burden of multi-camera autonomous driving pipelines by integrating object tracking, trajectory prediction, road segmentation, and depth estimation into a single real-time system processing at 29 FPS on one GPU. The core innovation lies in sharing a Swin Transformer backbone across modules while introducing task-specific optimizations like C-BYTE tracking with camera-motion compensation and a coarse-to-fine depth estimator. This matters because it offers an interpretable middle ground between black-box end-to-end driving and expensive bird's-eye-view mapping systems.
Zero-shot 3D anomaly detection enables industrial inspection without target-category training data, but existing methods discard geometric details by projecting point clouds to 2D images. This paper proposes BTP (Back To Point), the first framework to apply pre-trained Point-Language Models directly on 3D point clouds. By aligning multi-granularity patch features with text embeddings and incorporating geometric descriptors, BTP achieves fine-grained anomaly localization while avoiding view-dependent projection artifacts.
Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors with 15–25% metastatic risk and poor survival. Manual GAPP scoring for metastatic risk is labor-intensive and subjective, while critical genotype information (e.g., SDHB mutations conferring 35–75% metastatic risk) is often missed in clinical practice. This paper introduces PPGL-Swarm, an agentic diagnostic system that decomposes diagnosis into specialized WSI, gene, and table agents coordinated via reinforcement learning to automate GAPP scoring, predict hereditary mutations (SDHB/VHL/RET) from histology alone, and generate auditable multimodal reports grounded in a structured knowledge graph.
Accurate riverine land cover mapping is essential for river management but challenging due to water penetration issues in 2D imagery and complex 3D structure. This paper applies Point Transformer v2 (PTv2)—using grouped vector attention and partition-based pooling—to multispectral LiDAR point clouds (1550 nm, 905 nm, 532 nm) for semantic segmentation of six land cover classes in Finnish river environments. The authors demonstrate that spectral features (particularly intensity and reflectance) combined with geometric data achieve $0.950$ mean IoU, and propose multi-dataset training with sparse annotations to improve cross-site generalization despite severe class imbalance.
This paper proposes applying Vision Transformers with colormap-based pseudo-color enhancement to brain tumor classification on the BRISC2025 MRI dataset. The core idea wraps a standard ViT-Base model with a Jet colormap preprocessing step to boost contrast, claiming 98.90% accuracy on four-class tumor classification. While the technique is sound in principle, serious copy-paste errors indicate the manuscript was likely templated from the author's prior Alzheimer's work without adequate revision.
Stochastic human motion prediction often suffers from high-frequency jitter and physically implausible poses. This paper proposes KHMP, a framework that combines training-time physical constraints (temporal smoothness and joint angle limits) with a novel inference-time refinement: an adaptive Kalman filter operating in the DCT frequency domain. The key innovation treats high-frequency DCT coefficients as a frequency-indexed noisy signal, recursively filtering them with parameters dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR).
This paper addresses federated learning for cross-view video understanding, where heterogeneous camera viewpoints create highly non-IID client distributions that impede generalization to unseen views. FedCVU proposes three complementary modules: VS-Norm preserves client-specific normalization statistics to handle view-dependent feature shifts; CV-Align introduces lightweight prototype-based contrastive learning to align representations across cameras; and SLA employs selective layer aggregation to reduce communication overhead by 40–45%. The work targets an important practical scenario—privacy-preserving multi-camera surveillance where centralizing raw footage is infeasible.
SatGeo-NeRF addresses wave-like geometric artifacts in satellite neural radiance fields caused by overfitting to multi-temporal imagery with varying lighting and transient objects. The paper proposes three model-agnostic regularizers—gravity-aligned planarity, coarse-to-fine granularity masking, and depth supervision—to stabilize geometry learning. Experiments on the DFC2019 benchmark report 14% lower mean altitude error relative to prior work, though this comparison relies on a reimplemented baseline that underperforms the original reported scores.
This paper benchmarks four deep learning architectures (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for aerial LiDAR semantic segmentation under real operational flight conditions in Navarre, Spain. The study addresses a critical gap in evaluating models on heterogeneous aerial data with severe class imbalance (vehicles at 0.68%, low vegetation at 1.41%), finding that while all models exceed 93% overall accuracy, mean IoU ranges from 71.98% to 78.51% with persistent failures on minority classes.
4DGS360 addresses the ill-posed challenge of reconstructing dynamic objects from monocular video by tackling a critical failure mode: existing methods rely on 2D-native priors that overfit to visible surfaces and cannot reconstruct occluded regions at extreme viewpoints (>90°). The authors propose AnchorTAP3D, a hybrid 3D tracker that leverages high-confidence 2D track points as spatial-temporal anchors to stabilize long-term tracking and resolve depth ambiguity in occluded areas. Combined with a new iPhone360 benchmark featuring test cameras up to 135° from training views, the method enables coherent 360° 4D reconstruction without diffusion priors.
Omni-WorldBench addresses the gap between passive video generation metrics and active world model evaluation by focusing on interactive response—how actions causally drive state transitions across space and time. It introduces Omni-WorldSuite, a 1,068-prompt hierarchical taxonomy spanning three interaction levels (single-object to global environmental effects), and Omni-Metrics, an agent-based evaluation protocol that aggregates Interaction Effect Fidelity, Generated Video Quality, and Camera-Object Controllability into an adaptive AgenticScore.
AdaEdit tackles the injection dilemma in flow-based image editing, where source feature injection preserves backgrounds but suppresses novel content generation. The authors propose two training-free adaptations: a Progressive Injection Schedule using continuous decay functions (sigmoid, cosine, linear) instead of binary cutoffs, and Channel-Selective Latent Perturbation that applies per-channel AdaIN based on distributional gaps between inverted and random latents. Extensive experiments on PIE-Bench show AdaEdit improves background preservation metrics by 8.7% LPIPS reduction versus ProEdit while maintaining competitive CLIP scores.
This paper presents PhotoBeamSolver, a hybrid system that converts hand-drawn beam diagrams into analytical structural solutions by combining computer vision with large language models. The core idea uses a custom-trained YOLO-based detector to identify supports and loads from images, feeding a symbolic solver that computes shear, moment, and deflection diagrams. While targeted at academic and quick professional verification tasks, the work highlights the challenges of integrating deep learning into safety-critical structural engineering workflows.
This paper challenges the long-held assumption that infrared and visible image fusion (IVIF) requires strictly paired training data. The authors propose UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP), demonstrating that pixel-level self-supervision enables training on unaligned cross-modal combinations. By reformulating the maximum likelihood objective to treat infrared and visible images as independent variables, they show that a base dataset of $N$ pairs can be expanded to $N^2$ trainable combinations, potentially reducing collection costs while improving generalization.
Articulated object reconstruction typically requires either multi-view capture of discrete states or monocular video with a strict static-base-part assumption, limiting practical deployment. FreeArtGS introduces a "free-moving" setting where both joint angles and object poses vary arbitrarily during capture, using only a monocular RGB-D video. The method combines motion-based part segmentation via point tracking priors with joint estimation and 3D Gaussian Splatting optimization to jointly reconstruct geometry, appearance, and articulation.
Cardiac ultrasound view acquisition is notoriously operator-dependent, limiting reproducibility and access. This paper proposes an anatomical prior (AP)-driven framework that unifies cardiac structure segmentation with autonomous probe adjustment. The core innovation is a spatial-relation graph (SRG) module that injects spatial-topological constraints into YOLO-based segmentation, coupled with an RL formulation where states and rewards are built from quantifiable anatomical features drawn from Gaussian priors. The work matters because it offers an interpretable alternative to black-box end-to-end methods, potentially enabling zero-shot sim-to-real deployment for robotic echocardiography.