KG-Hopper: Empowering Compact Open LLMs with Knowledge Graph Reasoning via Reinforcement Learning
KG-Hopper addresses Knowledge Base Question Answering (KBQA) by training compact 7B LLMs to perform multi-hop reasoning over Knowledge Graphs in a single inference round. Unlike sequential multi-step approaches that suffer from error cascades, it embeds the entire KG traversal process into a unified "thinking" stage using reinforcement learning. The core innovation is using GRPO (Group Relative Policy Optimization) with composite rewards to teach models to autonomously invoke retrieval tools via special tokens and reason across multiple hops without predefined pipelines.
KG-Hopper presents a promising approach to KBQA that aligns with recent trends in reasoning LLMs (Chain of Thought, o1-style thinking). The method is well-motivated: step-by-step KBQA pipelines indeed suffer from error propagation and inflexibility. The use of RL to train tool-augmented reasoning within a single LLM call is technically sound. However, the paper's central claim of novelty is weakened by prior RL-based KG traversal work (DeepPath, MINERVA) that the authors themselves cite. The statement "To the best of our knowledge, this is the first work to apply RL to enable end-to-end KG reasoning within LLMs" is questionable given the cited works Xiong et al., 2017 and Das et al., 2018 used RL for KG navigation, though perhaps not with modern LLMs as the policy. The empirical results are impressive on the surface—a 7B model outperforming 70B baselines—but without seeing Table I (referenced but not provided in the text), full verification is impossible.
The reward design is comprehensive and well-thought-out. The four-component reward—covering retrieval (R_search), format (R_format), reasoning process (R_reason), and final answer (R_answer)—provides fine-grained supervision signals that address different aspects of the KBQA task. The masking of retrieved triples in the loss computation is a strong practical detail that prevents merely copying KG content. The history resampling strategy to remove easy one-hop questions after initial training demonstrates awareness of curriculum learning principles. Eq. (1) for retrieval reward $R_{\text{search}}=\min(0.5\cdot n,0.8)$ effectively balances tool use against excessive querying.
There are several critical issues. First, the experimental section references Table LABEL:tab:grouped-llm-comparison and Table I, but neither appears in the provided text, making it impossible to verify the quantitative claims. Second, the paper relies on external LLM judges (Llama-3.3-70B for reasoning quality, Llama-3.2-3B for answer evaluation) which introduces circularity concerns—using LLM-as-a-judge to train LLMs can create evaluation artifacts. Third, the claim about being "first to apply RL to end-to-end KG reasoning within LLMs" contradicts their own citations: Xiong et al. (2017) DeepPath and Das et al. (2018) both used RL for KG reasoning. While those used LSTMs rather than transformers, the core claim needs qualification. Finally, the abstract claims results on "eight KG reasoning benchmarks" but four of these (T-REx, Zero-Shot RE, Creak) are not standard KBQA datasets but slot-filling and fact verification tasks.
The evidence is incomplete without Table I. The authors claim their 7B model "consistently outperforms larger multi-step systems (up to 70B)" and matches GPT-4o-mini, but without the actual numbers, this cannot be verified. The comparison to prior work is selective—they cite Think-on-Graph and Interactive-KBQA but don't establish clear methodological distinctions beyond the single-call vs multi-step architecture. The use of "Hits@1" for slot-filling datasets is unconventional (these typically use F1 or exact match). The paper doesn't address why masking retrieved triples doesn't hurt the model's ability to learn from retrieved content—there seems to be a tension between masking triples in SFT and relying on them for answer correctness.
Reproducibility is mixed. On the positive side, the authors provide hyperparameters (learning rate $1\text{e}{-6}$, 2 epochs, batch size 16, GRPO with clip ratio 0.2, KL penalty $1\text{e}{-5}$) and dataset details. The code is claimed to be at https://github.com/Wangshuaiia/KG-Hopper. However, critical details are missing: (1) The cold-start dataset construction uses "few-shot prompting with a long CoT example" but the exact prompt and the LLM used for generation aren't specified; (2) The LLM-as-judge prompts for computing $R_{\text{reason}}$ and $R_{\text{answer}}$ are not provided, which is crucial since these rewards drive the entire RL training; (3) No ablation studies are shown to isolate the contribution of RL vs SFT vs the cold-start; (4) The history resampling threshold is not quantified.
Large Language Models (LLMs) demonstrate impressive natural language capabilities but often struggle with knowledge-intensive reasoning tasks. Knowledge Base Question Answering (KBQA), which leverages structured Knowledge Graphs (KGs) exemplifies this challenge due to the need for accurate multi-hop reasoning. Existing approaches typically perform sequential reasoning steps guided by predefined pipelines, restricting flexibility and causing error cascades due to isolated reasoning at each step. To address these limitations, we propose KG-Hopper, a novel Reinforcement Learning (RL) framework that empowers compact open LLMs with the ability to perform integrated multi-hop KG reasoning within a single inference round. Rather than reasoning step-by-step, we train a Reasoning LLM that embeds the entire KG traversal and decision process into a unified ``thinking'' stage, enabling global reasoning over cross-step dependencies and dynamic path exploration with backtracking. Experimental results on eight KG reasoning benchmarks show that KG-Hopper, based on a 7B-parameter LLM, consistently outperforms larger multi-step systems (up to 70B) and achieves competitive performance with proprietary models such as GPT-3.5-Turbo and GPT-4o-mini, while remaining compact, open, and data-efficient. The code is publicly available at: https://github.com/Wangshuaiia/KG-Hopper.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.