OpenEarth-Agent: From Tool Calling to Tool Creation for Open-Environment Earth Observation

cs.CV Sijie Zhao, Feng Liu, Xueliang Zhang, Hao Chen, Xinyu Gu, Zhe Jiang, Fenghua Ling, Ben Fei, Wenlong Zhang, Junjue Wang, Weihao Xuan, Pengfeng Xiao, Naoto Yokoya, Lei Bai · Mar 23, 2026
Local to this browser
What it does
OpenEarth-Agent tackles the challenge of deploying autonomous Earth Observation (EO) agents in open environments characterized by diverse multi-modal data and heterogeneous tasks. Unlike existing tool-calling agents confined to closed...
Why it matters
Unlike existing tool-calling agents confined to closed environments with predefined tools, this work introduces a tool-creation paradigm where the agent adaptively generates specialized tools tailored to unseen data and tasks. The paper...
Main concern
The paper presents a compelling conceptual advance by shifting from static tool-calling to dynamic tool-creation for EO tasks. The multi-agent architecture effectively decomposes the complex EO pipeline into manageable stages, and the...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

OpenEarth-Agent tackles the challenge of deploying autonomous Earth Observation (EO) agents in open environments characterized by diverse multi-modal data and heterogeneous tasks. Unlike existing tool-calling agents confined to closed environments with predefined tools, this work introduces a tool-creation paradigm where the agent adaptively generates specialized tools tailored to unseen data and tasks. The paper proposes a multi-agent architecture and OpenEarth-Bench (596 real-world cases across 7 domains) to evaluate this approach.

Critical review
Verdict
Bottom line

The paper presents a compelling conceptual advance by shifting from static tool-calling to dynamic tool-creation for EO tasks. The multi-agent architecture effectively decomposes the complex EO pipeline into manageable stages, and the benchmark construction is comprehensive. However, the practical deployment faces significant hurdles due to computational overhead and cascading error propagation in end-to-end pipelines. While the tool-creation paradigm demonstrates promising generalization, the 58.72% end-to-end accuracy on geospatial analysis (GPT-5) suggests substantial room for improvement in long-horizon task execution.

“GPT-5 geospatial analysis accuracy sharply decreases from 76.66% to 58.72%”
“This introduces higher computational overhead and processing latency compared to traditional, static predefined tool-calling pipelines, which may currently limit its deployment in highly time-critical emergency response scenarios.”
OpenEarth-Agent paper · Section 6.1
What holds up

The multi-agent architecture is well-designed with clear separation of concerns: Data Summary Agent for real-time data perception, Planning Agent for DAG-based workflow generation, and Checking Agent for iterative refinement. The ablation studies provide strong evidence for each component's contribution—removing the Result Check Agent causes accuracy to plummet to 34.73% in end-to-end geospatial analysis, validating the necessity of feedback loops. The benchmark construction is rigorous with 596 full-pipeline cases spanning data preparation, feature extraction, and geospatial analysis across 7 domains, filling a gap in open-environment evaluation.

“✓ ✓ ✗ (no Result Check) ... 34.73 ... 78.02% (with Result Check) in End-to-End Geospatial Analysis”
OpenEarth-Agent paper · Table 4
“OpenEarth-Bench (Ours) achieves ✓ for Data Preparation, Feature Extraction, Geospatial Analysis, and Open Environment with modalities including RGB, Multi-Spectral, SAR, NTL, and Product, while existing benchmarks lack coverage in these areas.”
OpenEarth-Agent paper · Table 1
Main concerns

The primary limitation is computational cost—the iterative feedback loops and dynamic tool creation require multiple LLM inference calls, making the system unsuitable for time-critical applications. More critically, the end-to-end evaluation reveals severe cascading errors: even GPT-5 drops from 85.40% to 82.38% in data preparation and from 76.66% to 58.72% in geospatial analysis when moving from stage-wise to end-to-end evaluation. This suggests the system struggles with error propagation across the long-horizon pipeline. The claim of creating "functionally equivalent" tools compared to human-engineered ones lacks rigorous quantitative validation beyond anecdotal examples. Additionally, the comparison on Earth-Bench disables external knowledge integration for OpenEarth-Agent, but it's unclear if this handicaps the system or if the baseline Earth-Agent has similar constraints.

“GPT-5: Data Preparation 85.40 (Stage-Wise) vs 82.38 (End-to-End); Geospatial Analysis 76.66 (Stage-Wise) vs 58.72 (End-to-End)”
OpenEarth-Agent paper · Table 2
“To ensure a fair comparison, we disabled all external knowledge and tool integration within OpenEarth-Agent.”
OpenEarth-Agent paper · Section 5.2
“The continuous cycle of real-time data perception, dynamic DAG planning, and iterative tool refinement requires multiple LLM inference calls.”
OpenEarth-Agent paper · Section 6.1
Evidence and comparison

The evidence supports the core claim that tool-creation can match tool-calling performance with fewer pre-existing tools—Table 3 shows OpenEarth-Agent with only 6 tools achieves 59.92% accuracy versus Earth-Agent's 63.16% (GPT-5) on Earth-Bench. When given all 104 tools, OpenEarth-Agent reaches 67.61%, outperforming the baseline. The paper identifies specific robustness advantages: created tools handle hard-coded sensor parameters, invalid value masking, and numerical processing better than predefined tools that fail on data distribution shifts. However, the comparison assumes Earth-Agent cannot adapt its tool-calling strategy, which may underestimate the baseline. The ablation studies robustly demonstrate that knowledge integration ($\mathcal{K}$) and tool integration ($\mathcal{T}$) provide complementary benefits, with combined use achieving optimal performance (85.40% vs 79.69% without either).

“OpenEarth-Agent (6 Tools): 59.92, Earth-Agent: 63.16, OpenEarth-Agent (Full Tools): 67.61 with GPT-5 on Earth-Bench”
OpenEarth-Agent paper · Table 3
“Hard-coded sensor parameters: Numerous tools rigidify parameters for specific sensors... Omission of invalid value masking... Inappropriate numerical processing... In contrast, OpenEarth-Agent overcomes these bottlenecks by leveraging active data perception”
OpenEarth-Agent paper · Section 5.2
“Knowledge Integration ✓, Tool Integration ✓ achieves 85.40 (Data Preparation), 85.27 (Feature Extraction), 76.66 (Geospatial Analysis) in Stage-Wise vs 79.69, 81.21, 72.48 with neither.”
OpenEarth-Agent paper · Table 5
Reproducibility

The paper states that "Code and Benchmark will be available" with a GitHub link provided, though the repository accessibility was not verified at the time of review. The methodology is described in sufficient detail for implementation—the multi-agent architecture, DAG-based workflow planning in Equation (1) $p^{*}=\mathcal{A}(\mathcal{P}), \mathcal{P}=\mathcal{G}(\mathcal{Q},\mathcal{D},\mathcal{K},\mathcal{T})$, and iterative refinement in Equation (2) $O^{(k+1)}=\mathcal{E}(\mathcal{M}(\mathcal{C}^{(k)},\mathcal{F}^{(k)}),\mathcal{D})$ are clearly formalized. However, several hyperparameters remain unspecified: the number of candidate plans $n$ in the aggregation mechanism, thresholds for validity checking in $\mathcal{V}$, and the embedding models used for semantic retrieval. The evaluation protocol is well-defined with three metrics (Accuracy, Debug Rounds, Running Time) across stage-wise and end-to-end settings, and error margins for numerical fidelity checks are mentioned but not quantified precisely.

“Code and Benchmark will be available at https://github.com/walking-shadow/OpenEarth-Agent”
OpenEarth-Agent paper · Abstract
“Equation (1): p^{*}=\mathcal{A}(\mathcal{P}), \mathcal{P}=\mathcal{G}(\mathcal{Q},\mathcal{D},\mathcal{K},\mathcal{T})”
OpenEarth-Agent paper · Section 3.1
“Equation (2): O^{(k+1)}=\mathcal{E}(\mathcal{M}(\mathcal{C}^{(k)},\mathcal{F}^{(k)}),\mathcal{D}), \quad\text{s.t.}\quad \mathcal{V}(O^{(k+1)})=1”
OpenEarth-Agent paper · Section 3.1
Abstract

Earth Observation (EO) is essential for perceiving dynamic land surface changes, yet deploying autonomous EO in open environments is hindered by the immense diversity of multi-source data and heterogeneous tasks. While remote sensing agents have emerged to streamline EO workflows, existing tool-calling agents are confined to closed environments. They rely on pre-defined tools and are restricted to narrow scope, limiting their generalization to the diverse data and tasks. To overcome these limitations, we introduce OpenEarth-Agent, the first tool-creation agent framework tailored for open-environment EO. Rather than calling predefined tools, OpenEarth-Agent employs adaptive workflow planning and tool creation to generalize to unseen data and tasks. This adaptability is bolstered by an open-ended integration of multi-stage tools and cross-domain knowledge bases, enabling robust execution in the entire EO pipeline across multiple application domains. To comprehensively evaluate EO agents in open environments, we propose OpenEarth-Bench, a novel benchmark comprising 596 real-world, full-pipeline cases across seven application domains, explicitly designed to assess agents' adaptive planning and tool creation capabilities. Only essential pre-trained model tools are provided in this benchmark, devoid of any other predefined task-specific tools. Extensive experiments demonstrate that OpenEarth-Agent successfully masters full-pipeline EO across multiple domains in the open environment. Notably, on the cross-benchmark Earth-Bench, our tool-creating agent equipped with 6 essential pre-trained models achieves performance comparable to tool-calling agents relying on 104 specialized tools, and significantly outperforms them when provided with the complete toolset. In several cases, the created tools exhibit superior robustness to data anomalies compared to human-engineered counterparts.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.