KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning

cs.CL cs.AI Shuai Wang, Yinan Yu · Mar 22, 2026

What it does

Why it matters

Unlike sequential multi-step approaches that suffer from error cascades, it embeds the entire KG traversal process into a unified "thinking" stage using reinforcement learning. The core innovation is using GRPO (Group Relative Policy...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

KG-Hopper addresses Knowledge Base Question Answering (KBQA) by training compact 7B LLMs to perform multi-hop reasoning over Knowledge Graphs in a single inference round. Unlike sequential multi-step approaches that suffer from error cascades, it embeds the entire KG traversal process into a unified "thinking" stage using reinforcement learning. The core innovation is using GRPO (Group Relative Policy Optimization) with composite rewards to teach models to autonomously invoke retrieval tools via special tokens and reason across multiple hops without predefined pipelines.

Critical review

Verdict

Bottom line

KG-Hopper presents a promising approach to KBQA that aligns with recent trends in reasoning LLMs (Chain of Thought, o1-style thinking). The method is well-motivated: step-by-step KBQA pipelines indeed suffer from error propagation and inflexibility. The use of RL to train tool-augmented reasoning within a single LLM call is technically sound. However, the paper's central claim of novelty is weakened by prior RL-based KG traversal work (DeepPath, MINERVA) that the authors themselves cite. The statement "To the best of our knowledge, this is the first work to apply RL to enable end-to-end KG reasoning within LLMs" is questionable given the cited works Xiong et al., 2017 and Das et al., 2018 used RL for KG navigation, though perhaps not with modern LLMs as the policy. The empirical results are impressive on the surface—a 7B model outperforming 70B baselines—but without seeing Table I (referenced but not provided in the text), full verification is impossible.

“To the best of our knowledge, this is the first work to apply RL to enable end-to-end KG reasoning within LLMs”

paper · Introduction, Contributions

“Reinforcement learning (RL) have demonstrated its effectiveness in navigating discrete and combinatorial decision spaces, such as KG traversal by optimizing reasoning policies through exploration and long-term rewards [xiong2017deeppath, das2018go, lin2018multi]”

paper · Introduction

What holds up

The reward design is comprehensive and well-thought-out. The four-component reward—covering retrieval (R_search), format (R_format), reasoning process (R_reason), and final answer (R_answer)—provides fine-grained supervision signals that address different aspects of the KBQA task. The masking of retrieved triples in the loss computation is a strong practical detail that prevents merely copying KG content. The history resampling strategy to remove easy one-hop questions after initial training demonstrates awareness of curriculum learning principles. Eq. (1) for retrieval reward $R_{\text{search}}=\min(0.5\cdot n,0.8)$ effectively balances tool use against excessive querying.

“The retrieval reward is defined as: $R_{\text{search}}=\min(0.5\cdot n,0.8)$ where $n$ denotes the number of times the query tool is invoked. This design incentivizes query usage while discouraging excessive or redundant retrievals”

paper · Section III-C

“We therefore mask tokens enclosed by <triples> and </triples> during loss computation to prevent the model from learning to reproduce retrieved content”

paper · Section III-D

Main concerns

There are several critical issues. First, the experimental section references Table LABEL:tab:grouped-llm-comparison and Table I, but neither appears in the provided text, making it impossible to verify the quantitative claims. Second, the paper relies on external LLM judges (Llama-3.3-70B for reasoning quality, Llama-3.2-3B for answer evaluation) which introduces circularity concerns—using LLM-as-a-judge to train LLMs can create evaluation artifacts. Third, the claim about being "first to apply RL to end-to-end KG reasoning within LLMs" contradicts their own citations: Xiong et al. (2017) DeepPath and Das et al. (2018) both used RL for KG reasoning. While those used LSTMs rather than transformers, the core claim needs qualification. Finally, the abstract claims results on "eight KG reasoning benchmarks" but four of these (T-REx, Zero-Shot RE, Creak) are not standard KBQA datasets but slot-filling and fact verification tasks.

“The scoring model $f_r$ for the reasoning process is Llama-3.3-70B, while the model $f_a$ used to determine whether the predicted answer matches the ground truth is Llama-3.2-3B”

paper · Section IV-B

“T-REx and Zero-Shot RE are designed for slot filling tasks, and Creak focuses on factual verification”

paper · Section IV-A

Evidence and comparison

The evidence is incomplete without Table I. The authors claim their 7B model "consistently outperforms larger multi-step systems (up to 70B)" and matches GPT-4o-mini, but without the actual numbers, this cannot be verified. The comparison to prior work is selective—they cite Think-on-Graph and Interactive-KBQA but don't establish clear methodological distinctions beyond the single-call vs multi-step architecture. The use of "Hits@1" for slot-filling datasets is unconventional (these typically use F1 or exact match). The paper doesn't address why masking retrieved triples doesn't hurt the model's ability to learn from retrieved content—there seems to be a tension between masking triples in SFT and relying on them for answer correctness.

“KG-Hopper, based on a 7B-parameter LLM, consistently outperforms multi-step methods using models up to 70B, and matches or exceeds the performance of GPT-4o-mini + KG”

paper · Section IV-C

Reproducibility

Reproducibility is mixed. On the positive side, the authors provide hyperparameters (learning rate $1\text{e}{-6}$, 2 epochs, batch size 16, GRPO with clip ratio 0.2, KL penalty $1\text{e}{-5}$) and dataset details. The code is claimed to be at https://github.com/Wangshuaiia/KG-Hopper. However, critical details are missing: (1) The cold-start dataset construction uses "few-shot prompting with a long CoT example" but the exact prompt and the LLM used for generation aren't specified; (2) The LLM-as-judge prompts for computing $R_{\text{reason}}$ and $R_{\text{answer}}$ are not provided, which is crucial since these rewards drive the entire RL training; (3) No ablation studies are shown to isolate the contribution of RL vs SFT vs the cold-start; (4) The history resampling threshold is not quantified.

“We train for 2 epochs with a batch size of 16 and a learning rate of $1\text{e}{-6}$. The rollout temperature is set to 1, the PPO clip ratio is 0.2, and the KL divergence penalty coefficient is $1\text{e}{-5}$”

paper · Section IV-B

“The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper”

paper · Abstract

Abstract

Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.