Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre
This paper benchmarks four deep learning architectures (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for aerial LiDAR semantic segmentation under real operational flight conditions in Navarre, Spain. The study addresses a critical gap in evaluating models on heterogeneous aerial data with severe class imbalance (vehicles at 0.68%, low vegetation at 1.41%), finding that while all models exceed 93% overall accuracy, mean IoU ranges from 71.98% to 78.51% with persistent failures on minority classes.
The paper presents a solid empirical benchmark comparing convolutional, MLP-based and transformer architectures on operational aerial LiDAR data. While the contribution is primarily incremental—a standard benchmark on a new regional dataset—the finding that KPConv outperforms newer transformers on this specific data is valuable for practitioners. The geographic limitation to a single region and narrow five-class taxonomy restricts broad generalizability, and the claim of bridging research-to-practice gaps is overstated without cross-dataset validation.
The experimental protocol is rigorous and fair, with all four models trained under identical conditions using their publicly available implementations. The dataset effectively captures real-world challenges including severe class imbalance and heterogeneous environments spanning urban, rural and industrial landscapes. The analysis properly distinguishes between overall accuracy (dominated by majority classes) and mean IoU ($\mathrm{mIoU}=\frac{1}{N_{c}}\sum_{c=1}^{N_{c}}\mathrm{IoU}_{c}$), revealing that high global metrics mask poor performance on underrepresented categories.
The study's generalizability is constrained by its limited geographic scope (single region) and narrow semantic granularity (only five classes). The authors acknowledge severe class imbalance but do not investigate architectural modifications, loss reweighting, or sampling strategies to mitigate it, leaving unclear whether poor low-vegetation results reflect fundamental model limitations or suboptimal training procedures. Furthermore, the qualitative analysis references figures (e.g., "Figure LABEL:fig:qualitative_comparison") that are not rendered in the provided text, weakening the visual evidence chain.
The quantitative evidence supports the relative performance rankings, with KPConv achieving the best mean IoU (78.51%) and PTv3 excelling on vehicles (75.11% IoU). However, comparisons to related aerial benchmarks like DALES and FRACTAL remain qualitative rather than quantitative, as the authors do not reproduce baselines from those datasets under identical conditions. The paper would benefit from explicit domain-gap analysis comparing these models' performance on indoor/terrestrial benchmarks versus the aerial data to validate claims about transferability challenges.
Reproducibility is partially strong: the authors provide links to public code repositories for all models and specify hardware (NVIDIA RTX 6000 Ada, 48GB VRAM) and training seeds (three runs averaged). However, the Navarre dataset itself is not publicly available, preventing exact reproduction. Hyperparameters follow original publications with "minor adjustments to accommodate the spatial scale," but specific values (learning rates, augmentation details) and exhaustive preprocessing parameters (coordinate normalization ranges, tiling overlap percentages) are insufficiently documented for full replication.
Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.