Nothing here yet
MS-CustomNet tackles multi-subject customization for text-to-image diffusion models, where the challenge is to preserve multiple subject identities while controlling their compositional arrangement and spatial relationships. The authors propose a framework built on CustomNet that accepts multiple reference images plus a layout map $M_L$ specifying spatial arrangement, trained on a curated MSI dataset derived from COCO. The work aims to provide explicit deterministic control over subject placement and layering (e.g., "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.
PAS3R tackles online monocular 3D reconstruction from long video streams, addressing the stability–adaptation dilemma where models must incorporate novel viewpoints without overwriting historical scene structure. The core idea is to dynamically modulate state update intensity based on geometric novelty: measuring inter-frame camera displacement (translation + rotation) and image frequency content via Fourier analysis. This enables faster adaptation to abrupt viewpoint changes while preserving accumulated geometry during smooth motion.
EmoTaG tackles few-shot 3D talking-head synthesis with emotional expressiveness using only 5 seconds of target video. The core insight is to predict FLAME parameters (expression and jaw pose) rather than directly deforming 3D Gaussians, providing explicit geometric priors for stability. A Gated Residual Motion Network (GRMN) disentangles phonetic articulation from emotion-driven variations with a learned gate $g \in [0,1]$, while Semantic Emotion Guidance distills knowledge from a pretrained DeepFace recognizer to supervise emotional intensity without manual labels.
CornOrb addresses a persistent gap in ophthalmic AI by providing one of the first large-scale, publicly accessible Orbscan 3 corneal topography datasets. The collection comprises 1,454 eyes from 744 Algerian patients, offering four standardized corneal maps (axial curvature, anterior/posterior elevation, pachymetry) alongside structured clinical parameters including Kmax, astigmatism, and asphericity. By releasing this multimodal resource in standardized PNG and CSV formats, the authors aim to enable robust AI-driven detection of keratoconus using device-specific data from an underrepresented African population.
PGR-Net addresses brain tumor MRI segmentation by tackling the challenge of spatial sparsity—where lesions occupy only ~10.7% of the image volume—through explicit data-driven spatial priors. The framework introduces a hierarchical Top-K ROI selection mechanism and a Windowed Gaussian–Spatial Decay (WinGS-ROI) module to concentrate computational resources on lesion-relevant regions rather than background. This yields competitive Dice scores (89.02–91.82% on Whole Tumor across benchmarks) with only 8.64M parameters, offering a lightweight alternative to contemporary Transformer and Mamba architectures.
This paper tackles robotic optical coherence tomography (OCT) scanning of curved tissue surfaces, addressing the limitation that existing approaches restrict motion to pure translations to avoid challenging hand-eye calibration. The core contribution is a custom ChArUco calibration pattern enabling full six-degree-of-freedom hand-eye calibration, allowing the OCT probe to rotate and follow curved surfaces. This matters because pure translational scanning accumulates registration errors on curved geometries, whereas full 6D motion enables accurate, large-area surface reconstruction.
The paper addresses the novel challenge of aligning independent 3D Gaussian Splatting models across different object instances within the same category—a task beyond existing same-object registration methods. The core innovation is a two-stage pipeline: first, a coarse alignment using a feature-guided iterative absolute orientation solver that handles extreme initializations (180° rotations, 10× scale differences); second, a fine alignment that enforces multi-view feature consistency via an inverse-radiance-field formulation generalized to the similarity group $\text{Sim}(3)$. This enables the first viable category-level 3DGS registration, unlocking applications like geometrically-consistent object replacement.
Multi-agent applications execute tasks through multi-stage workflows where each stage is an LLM call feeding into the next. While heterogeneous clusters (mixing model sizes/families) enable better latency–performance trade-offs than homogeneous deployments, they introduce complex scheduling challenges: model selection affects both task accuracy and queue congestion. Chimera addresses this by predicting per-model confidence scores, forecasting total workflow output lengths, and estimating real-time load via in-flight token volumes to jointly optimize end-to-end latency and task performance.
This paper presents LRHPerception, a unified monocular perception package that addresses the computational burden of multi-camera autonomous driving pipelines by integrating object tracking, trajectory prediction, road segmentation, and depth estimation into a single real-time system processing at 29 FPS on one GPU. The core innovation lies in sharing a Swin Transformer backbone across modules while introducing task-specific optimizations like C-BYTE tracking with camera-motion compensation and a coarse-to-fine depth estimator. This matters because it offers an interpretable middle ground between black-box end-to-end driving and expensive bird's-eye-view mapping systems.
Zero-shot 3D anomaly detection enables industrial inspection without target-category training data, but existing methods discard geometric details by projecting point clouds to 2D images. This paper proposes BTP (Back To Point), the first framework to apply pre-trained Point-Language Models directly on 3D point clouds. By aligning multi-granularity patch features with text embeddings and incorporating geometric descriptors, BTP achieves fine-grained anomaly localization while avoiding view-dependent projection artifacts.
Pheochromocytomas and paragangliomas (PPGLs) are rare neuroendocrine tumors with 15–25% metastatic risk and poor survival. Manual GAPP scoring for metastatic risk is labor-intensive and subjective, while critical genotype information (e.g., SDHB mutations conferring 35–75% metastatic risk) is often missed in clinical practice. This paper introduces PPGL-Swarm, an agentic diagnostic system that decomposes diagnosis into specialized WSI, gene, and table agents coordinated via reinforcement learning to automate GAPP scoring, predict hereditary mutations (SDHB/VHL/RET) from histology alone, and generate auditable multimodal reports grounded in a structured knowledge graph.
Accurate riverine land cover mapping is essential for river management but challenging due to water penetration issues in 2D imagery and complex 3D structure. This paper applies Point Transformer v2 (PTv2)—using grouped vector attention and partition-based pooling—to multispectral LiDAR point clouds (1550 nm, 905 nm, 532 nm) for semantic segmentation of six land cover classes in Finnish river environments. The authors demonstrate that spectral features (particularly intensity and reflectance) combined with geometric data achieve $0.950$ mean IoU, and propose multi-dataset training with sparse annotations to improve cross-site generalization despite severe class imbalance.
The paper tackles the 'semantic parsing burden'—the effort required to translate natural language into structured RDF/OWL representations for knowledge graphs. It proposes the Semantic Ladder, a five-level framework ($L_1$ to $L_5$) enabling progressive formalization from raw text snippets to higher-order logic. By introducing Rosetta Statements as semantic anchors and emphasizing modular semantic units, the work aims to lower barriers to knowledge graph construction while maintaining semantic continuity.
This paper proposes applying Vision Transformers with colormap-based pseudo-color enhancement to brain tumor classification on the BRISC2025 MRI dataset. The core idea wraps a standard ViT-Base model with a Jet colormap preprocessing step to boost contrast, claiming 98.90% accuracy on four-class tumor classification. While the technique is sound in principle, serious copy-paste errors indicate the manuscript was likely templated from the author's prior Alzheimer's work without adequate revision.
Stochastic human motion prediction often suffers from high-frequency jitter and physically implausible poses. This paper proposes KHMP, a framework that combines training-time physical constraints (temporal smoothness and joint angle limits) with a novel inference-time refinement: an adaptive Kalman filter operating in the DCT frequency domain. The key innovation treats high-frequency DCT coefficients as a frequency-indexed noisy signal, recursively filtering them with parameters dynamically adjusted based on estimated Signal-to-Noise Ratio (SNR).
This paper addresses federated learning for cross-view video understanding, where heterogeneous camera viewpoints create highly non-IID client distributions that impede generalization to unseen views. FedCVU proposes three complementary modules: VS-Norm preserves client-specific normalization statistics to handle view-dependent feature shifts; CV-Align introduces lightweight prototype-based contrastive learning to align representations across cameras; and SLA employs selective layer aggregation to reduce communication overhead by 40–45%. The work targets an important practical scenario—privacy-preserving multi-camera surveillance where centralizing raw footage is infeasible.
SatGeo-NeRF addresses wave-like geometric artifacts in satellite neural radiance fields caused by overfitting to multi-temporal imagery with varying lighting and transient objects. The paper proposes three model-agnostic regularizers—gravity-aligned planarity, coarse-to-fine granularity masking, and depth supervision—to stabilize geometry learning. Experiments on the DFC2019 benchmark report 14% lower mean altitude error relative to prior work, though this comparison relies on a reimplemented baseline that underperforms the original reported scores.
As consumer-grade EEG headphones enter the market, a critical question emerges: can language models adapt to your specific neural signature? This paper demonstrates that frozen LLMs already contain person-specific linear directions in their activation spaces that predict individual brain activity during reading, achieving a ninefold improvement over population averages. The findings suggest that deep neural networks encode stable, individual cognitive fingerprints that could enable future brain-computer interfaces to personalize AI to the user wearing the headset.
This paper investigates whether domain knowledge for quantum code generation should be embedded in model parameters through fine-tuning or provided at inference time via retrieval and agents. Comparing a parameter-specialized Granite-20B baseline against modern general-purpose LLMs (OpenAI, Claude, Gemini) on the Qiskit-HumanEval benchmark, the authors find that inference-time augmentation—particularly agentic execution feedback—outperforms fine-tuning by over 35 percentage points, offering a more maintainable path as quantum SDKs evolve.
This paper benchmarks four deep learning architectures (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for aerial LiDAR semantic segmentation under real operational flight conditions in Navarre, Spain. The study addresses a critical gap in evaluating models on heterogeneous aerial data with severe class imbalance (vehicles at 0.68%, low vegetation at 1.41%), finding that while all models exceed 93% overall accuracy, mean IoU ranges from 71.98% to 78.51% with persistent failures on minority classes.