Graph Fusion Across Languages using Large Language Models

cs.CL cs.IR Kaung Myat Kyaw, Khush Agarwal, Jonathan Chan · Mar 22, 2026
Local to this browser
What it does
This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal...
Why it matters
The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment...
Main concern
This exploratory study convincingly demonstrates that LLMs can achieve high precision (88% with confidence filtering) in zero-shot entity alignment, but the work remains a preliminary proof-of-concept with severe limitations. The...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper addresses cross-lingual knowledge graph fusion, where heterogeneous KGs in different languages must be unified without expensive manually-curated seed alignments. The core idea is to use Large Language Models as a universal semantic bridge by linearizing graph triplets into natural language sequences and sequentially agglomerating multiple graphs. This matters because it promises zero-shot alignment capability for low-resource languages where traditional embedding-based methods fail due to lack of training data.

Critical review
Verdict
Bottom line

This exploratory study convincingly demonstrates that LLMs can achieve high precision (88% with confidence filtering) in zero-shot entity alignment, but the work remains a preliminary proof-of-concept with severe limitations. The evaluation contradicts the paper's framing: despite claiming coverage of Chinese-English, Japanese-English, and French-English pairs, results are reported only for the Chinese-English subset. With recall at a critically low 23.6% and no quantitative comparison against recent LLM-based baselines like ZeroEA or Seg-Align, the practical utility for N-graph fusion remains unproven.

“To assess the framework's efficacy, we utilize the DBP15K (Chinese-English) Sun et al. ([2017]) dataset”
paper · Section 4.1
“We provide an evaluation on the multilingual DBP15K Sun et al. ([2017]), covering Chinese-English, Japanese-English, and French-English pairs”
paper · Introduction
“Recall: 23.6%”
paper · Table 1
What holds up

The modular pipeline architecture is well-designed, particularly the entity-centric partitioning that preserves topological neighborhood context within LLM context windows. The finding that LLM-generated confidence scores correlate strongly with alignment quality is valuable: true positives averaged $\sigma = 0.980$ versus $0.738$ for false positives, enabling effective precision-recall trade-offs via thresholding. The robust response parsing that recovers partial JSON from truncated outputs shows practical engineering awareness.

“True Positives exhibited a mean confidence of 0.980, whereas False Positives averaged 0.738”
paper · Section 4.4
“The parsing function $f_{parse}$ recovers partial JSON objects by identifying the last complete element in the sequence and closing unclosed braces”
paper · Section 3.5
Main concerns

The primary flaw is the evaluation scope mismatch: the paper frames itself as N-graph fusion but only evaluates binary Chinese-English alignment, completely omitting Japanese and French results promised in the introduction. Recall is prohibitively low at 23.6%, meaning the system misses three-quarters of valid alignments. The 'exhaustive batch pairing' strategy scales quadratically ($k \times k'$ Cartesian product) and requires 5-6 hours for a single language pair, contradicting claims of scalability. Most critically, the paper cites but does not compare against recent LLM-based aligners like ZeroEA or Seg-Align, making it impossible to assess whether the proposed method offers any advantage over simpler prompt-based baselines.

“This results in $k \times k'$ discrete reasoning tasks. While computationally intensive...”
paper · Section 3.3
“The pipeline completes the exhaustive batch processing in approximately 5–6 hours”
paper · Section 4.1
“Recall: 23.6%”
paper · Table 1
Evidence and comparison

The evidence supports high-precision zero-shot alignment but undermines scalability claims. The paper asserts 'linear computational complexity $O(N)$ relative to the number of graphs' (Section 3.6), but this ignores the dominant cost of exhaustive batch pairing between partitions. The 23.6% recall suggests the system fails to discover most alignments, likely due to the partition-and-pair strategy missing cross-partition correspondences. No numerical comparison is provided against embedding-based methods (MTransE, GCN-Align) or contemporary LLM approaches (ZeroEA, Seg-Align), leaving the reader unable to assess whether the gains justify the API costs and runtime.

“maintaining a linear computational complexity $O(N)$ relative to the number of graphs”
paper · Section 3.6
“This results in $k \times k'$ discrete reasoning tasks”
paper · Section 3.3
Reproducibility

The anonymous repository link (https://anonymous.4open.science/r/KG-Fusion-1A7D/README.md) provides code access, but critical experimental details are missing: exact partitioning hyperparameters (batch sizes), full prompt templates with system personas, and per-API-call costs. The use of Gemini 2.5 Flash with temperature 0.0 aids reproducibility but creates vendor lock-in and availability concerns; results may not transfer to open-weight models like Llama 3. The 5-6 hour runtime per dataset pair on unspecified hardware limits accessibility for replication.

“Our implementation and experimental framework are publicly available in our anonymized repository: https://anonymous.4open.science/r/KG-Fusion-1A7D/README.md”
paper · Abstract
“The pipeline employs Gemini 2.5 Flash with the temperature set to 0.0”
paper · Section 4.1
Abstract

Combining multiple knowledge graphs (KGs) across linguistic boundaries is a persistent challenge due to semantic heterogeneity and the complexity of graph environments. We propose a framework for cross-lingual graph fusion, leveraging the in-context reasoning and multilingual semantic priors of Large Language Models (LLMs). The framework implements structural linearization by mapping triplets directly into natural language sequences (e.g., [head] [relation] [tail]), enabling the LLM to map relations and reconcile entities between an evolving fused graph ($G_{c}^{(t-1)}$) and a new candidate graph ($G_{t}$). Evaluated on the DBP15K dataset, this exploratory study demonstrates that LLMs can serve as a universal semantic bridge to resolve cross-lingual discrepancies. Results show the successful sequential agglomeration of multiple heterogeneous graphs, offering a scalable, modular solution for continuous knowledge synthesis in multi-source, multilingual environments.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.