Benchmarking Deep Learning Models for Aerial LiDAR Point Cloud Semantic Segmentation under Real Acquisition Conditions: A Case Study in Navarre

cs.CV Alex Salvatierra, Jos\'e Antonio Sanz, Christian Guti\'errez, Mikel Galar · Mar 23, 2026
Local to this browser
What it does
This paper benchmarks four deep learning architectures (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for aerial LiDAR semantic segmentation under real operational flight conditions in Navarre, Spain. The study...
Why it matters
98% to 78. 51% with persistent failures on minority classes.
Main concern
The paper presents a solid empirical benchmark comparing convolutional, MLP-based and transformer architectures on operational aerial LiDAR data. While the contribution is primarily incremental—a standard benchmark on a new regional...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper benchmarks four deep learning architectures (KPConv, RandLA-Net, Superpoint Transformer, Point Transformer V3) for aerial LiDAR semantic segmentation under real operational flight conditions in Navarre, Spain. The study addresses a critical gap in evaluating models on heterogeneous aerial data with severe class imbalance (vehicles at 0.68%, low vegetation at 1.41%), finding that while all models exceed 93% overall accuracy, mean IoU ranges from 71.98% to 78.51% with persistent failures on minority classes.

Critical review
Verdict
Bottom line

The paper presents a solid empirical benchmark comparing convolutional, MLP-based and transformer architectures on operational aerial LiDAR data. While the contribution is primarily incremental—a standard benchmark on a new regional dataset—the finding that KPConv outperforms newer transformers on this specific data is valuable for practitioners. The geographic limitation to a single region and narrow five-class taxonomy restricts broad generalizability, and the claim of bridging research-to-practice gaps is overstated without cross-dataset validation.

“KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes”
Salvatierra et al. · Abstract
“The persistent difficulties with low vegetation, where IoU ranges from 11.23% to 33.61%”
Salvatierra et al. · Section V
What holds up

The experimental protocol is rigorous and fair, with all four models trained under identical conditions using their publicly available implementations. The dataset effectively captures real-world challenges including severe class imbalance and heterogeneous environments spanning urban, rural and industrial landscapes. The analysis properly distinguishes between overall accuracy (dominated by majority classes) and mean IoU ($\mathrm{mIoU}=\frac{1}{N_{c}}\sum_{c=1}^{N_{c}}\mathrm{IoU}_{c}$), revealing that high global metrics mask poor performance on underrepresented categories.

“All models were implemented using publicly available source code... and trained under the same experimental conditions to ensure comparability”
Salvatierra et al. · Section III-C
“This disparity arises because overall accuracy is dominated by majority classes, while mean IoU weights all categories equally”
Salvatierra et al. · Section IV-A
Main concerns

The study's generalizability is constrained by its limited geographic scope (single region) and narrow semantic granularity (only five classes). The authors acknowledge severe class imbalance but do not investigate architectural modifications, loss reweighting, or sampling strategies to mitigate it, leaving unclear whether poor low-vegetation results reflect fundamental model limitations or suboptimal training procedures. Furthermore, the qualitative analysis references figures (e.g., "Figure LABEL:fig:qualitative_comparison") that are not rendered in the provided text, weakening the visual evidence chain.

“The dataset comprises five semantic classes... highlighting significant class imbalance characteristic of operational aerial data, especially for the vehicle and low vegetation categories, which together account for around 2% of the labeled points”
Salvatierra et al. · Section III-A
“Figure LABEL:fig:qualitative_comparison shows a representative test scene segmented by the four evaluated architectures”
Salvatierra et al. · Section IV-B
Evidence and comparison

The quantitative evidence supports the relative performance rankings, with KPConv achieving the best mean IoU (78.51%) and PTv3 excelling on vehicles (75.11% IoU). However, comparisons to related aerial benchmarks like DALES and FRACTAL remain qualitative rather than quantitative, as the authors do not reproduce baselines from those datasets under identical conditions. The paper would benefit from explicit domain-gap analysis comparing these models' performance on indoor/terrestrial benchmarks versus the aerial data to validate claims about transferability challenges.

“Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%)”
Salvatierra et al. · Abstract
“Most of these architectures were developed and benchmarked on indoor or terrestrial datasets... making the direct transferability of conclusions to aerial contexts uncertain”
Salvatierra et al. · Section II
Reproducibility

Reproducibility is partially strong: the authors provide links to public code repositories for all models and specify hardware (NVIDIA RTX 6000 Ada, 48GB VRAM) and training seeds (three runs averaged). However, the Navarre dataset itself is not publicly available, preventing exact reproduction. Hyperparameters follow original publications with "minor adjustments to accommodate the spatial scale," but specific values (learning rates, augmentation details) and exhaustive preprocessing parameters (coordinate normalization ranges, tiling overlap percentages) are insufficiently documented for full replication.

“Each model was trained three times with different random seeds, and the reported results correspond to the average performance across these runs”
Salvatierra et al. · Section III-C
“Batch sizes were adjusted according to each architecture's requirements: 24 for RandLA-Net, 10 for KPConv and Point Transformer V3, and 4 for Superpoint Transformer”
Salvatierra et al. · Section III-B
Abstract

Recent advances in deep learning have significantly improved 3D semantic segmentation, but most models focus on indoor or terrestrial datasets. Their behavior under real aerial acquisition conditions remains insufficiently explored, and although a few studies have addressed similar scenarios, they differ in dataset design, acquisition conditions, and model selection. To address this gap, we conduct an experimental benchmark evaluating several state-of-the-art architectures on a large-scale aerial LiDAR dataset acquired under operational flight conditions in Navarre, Spain, covering heterogeneous urban, rural, and industrial landscapes. This study compares four representative deep learning models, including KPConv, RandLA-Net, Superpoint Transformer, and Point Transformer V3, across five semantic classes commonly found in airborne surveys, such as ground, vegetation, buildings, and vehicles, highlighting the inherent challenges of class imbalance and geometric variability in aerial data. Results show that all tested models achieve high overall accuracy exceeding 93%, with KPConv attaining the highest mean IoU (78.51%) through consistent performance across classes, particularly on challenging and underrepresented categories. Point Transformer V3 demonstrates superior performance on the underrepresented vehicle class (75.11% IoU), while Superpoint Transformer and RandLA-Net trade off segmentation robustness for computational efficiency.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.