Beyond Matching to Tiles: Bridging Unaligned Aerial and Satellite Views for Vision-Only UAV Navigation

cs.CV cs.AI Kejia Liu, Haoyang Zhou, Ruoyu Xu, Peicheng Wang, Mingli Song, Haofei Zhang · Mar 23, 2026

What it does

Why it matters

Instead of retrieving discrete satellite tiles, the proposed Bearing-UAV method jointly regresses continuous position and heading from four neighboring satellite tiles and a UAV view patch, enabling sub-tile localization accuracy while...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses vision-only UAV navigation in GNSS-denied environments by moving beyond the standard "matching-to-tile" (M2T) paradigm. Instead of retrieving discrete satellite tiles, the proposed Bearing-UAV method jointly regresses continuous position and heading from four neighboring satellite tiles and a UAV view patch, enabling sub-tile localization accuracy while maintaining a lightweight model. The work also introduces Bearing-UAV-90K, a multi-city dataset with heading annotations designed for unaligned cross-view scenarios.

Critical review

Verdict

Bottom line

The paper presents a compelling shift from retrieval-based to regression-based cross-view geo-localization, achieving significantly lower localization errors and enabling end-to-end navigation capabilities that prior matching methods cannot support. The multi-task design simultaneously predicting position and heading is well-motivated for navigation tasks where rotational drift is critical. However, the 50% navigation success rate and reliance on synthetic Google Earth data suggest that real-world deployment remains challenging.

“Our method improves SR@1 by ∼10% and LSR@15 by ∼60%, demonstrating a stronger ability to identify the RST closest to the UVP and accurately localize the UVP”

paper · Section 4.3.2

“Supporting UAV navigation requires not only accurate localization but also reliable heading information, which is largely overlooked by current methods”

paper · Section 1

What holds up

The technical approach is sound: the GLUF module effectively captures multi-scale structural cues via clustering and non-local blocks, while the relative coordinate encoding and cross-attention mechanism explicitly model spatial relationships critical for handling misalignment. Ablation studies rigorously validate that each architectural component contributes measurably to performance, with weather augmentation experiments demonstrating tangible robustness gains.

“Removing GLUF causes a significant drop in performance, indicating that clustering and recombining feature maps helps extract more robust local and global structures”

paper · Section 4.4

“weather-augmented training improves performance, reducing MLE by 1.1m and MHE by 3.3°”

paper · Section 4.3.2

Main concerns

Despite reporting a 50% success rate for navigation versus 0% for baselines, this leaves substantial room for failure in safety-critical applications, and the paper characterizes failure modes only vaguely as "feature-sparse regions" without quantitative analysis of catastrophic errors. The comparison to retrieval baselines is structurally asymmetric: while Bearing-UAV regresses continuous coordinates, baseline methods are physically constrained to tile centers, making the error gap partially an artifact of discretization rather than purely matching accuracy. Furthermore, the evaluation relies entirely on synthetic Google Earth data; without validation on real-world UAV imagery, robustness claims may not transfer to physical deployment.

“Compared with Ours VGG-16, the reduced SR and SPL of the weather-augmented model are partly due to several highly deviated trajectories”

paper · Section 4.3.6

“In Google Earth 3D mode (UAV-view mode), we directly sample UVPs over the same area”

paper · Section 4.1

Evidence and comparison

The evidence supports the core claim that regression outperforms tile-matching for localization precision, with Mean Localization Error (MLE) of $8.61\,\text{m}$ compared to $28$-$33\,\text{m}$ for University-1652, SUES-200, DenseUAV, and GTA-UAV. The navigation metric SR@20 (success within $20\,\text{m}$ radius) of $50\%$ for Bearing-UAV versus $0\%$ for baselines demonstrates clear superiority, though vision-only navigation remains challenging. The comparison fairness is compromised because retrieval methods cannot predict sub-tile positions by design, yet the paper does not provide fine-tuned baselines that regress offsets from retrieved tiles to isolate the true architectural benefits beyond the paradigm shift.

“All baselines lack heading estimation ability and exhibit localization errors around 30 m, far larger than our regression error of 8.6 m”

paper · Section 4.3.2

“This is mainly because they treat the retrieved or matched tile center as the final position”

paper · Section 4.3.2

Reproducibility

The paper provides comprehensive implementation details including hyperparameters (Adam, lr=$5\times10^{-5}$, batch size 16, Smooth L1 loss $\mathcal{L}_{sum}=0.8\mathcal{L}_{p}+0.2\mathcal{L}_{h}$), backbone specifications (VGG-16), and component dimensions ($K=4$ clusters, $D=256$). The code is publicly available, and the dataset construction process is described in detail, though actual data release depends on Google Earth licensing. One minor gap is the absence of precise random seed information or exact train/validation split indices, though the $7$:$2$:$1$ ratio is specified.

“We adopt VGG-16 pretrained on ImageNet as the visual backbone... We set the number of clusters in the GLUF to $K=4$ and base feature dimension to $D=256$”

paper · Section 4.2

“We use Adam (lr=$5\times10^{-5}$, batch size 16) for 100 epochs with Smooth L1 loss: $\mathcal{L}_{sum}=0.8\mathcal{L}_{p}+0.2\mathcal{L}_{h}$”

paper · Section 4.2

Abstract

Recent advances in cross-view geo-localization (CVGL) methods have shown strong potential for supporting unmanned aerial vehicle (UAV) navigation in GNSS-denied environments. However, existing work predominantly focuses on matching UAV views to onboard map tiles, which introduces an inherent trade-off between accuracy and storage overhead, and overlooks the importance of the UAV's heading during navigation. Moreover, the substantial discrepancies and varying overlaps in cross-view scenarios have been insufficiently considered, limiting their generalization to real-world scenarios. In this paper, we present Bearing-UAV, a purely vision-driven cross-view navigation method that jointly predicts UAV absolute location and heading from neighboring features, enabling accurate, lightweight, and robust navigation in the wild. Our method leverages global and local structural features and explicitly encodes relative spatial relationships, making it robust to cross-view variations, misalignment, and feature-sparse conditions. We also present Bearing-UAV-90k, a multi-city benchmark for evaluating cross-view localization and navigation. Extensive experiments show encouraging results that Bearing-UAV yields lower localization error than previous matching/retrieval paradigm across diverse terrains. Our code and dataset will be made publicly available.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.