Can LLMs Fool Graph Learning? Exploring Universal Adversarial Attacks on Text-Attributed Graphs

cs.AI Zihui Chen, Yuling Wang, Pengfei Jiao, Kai Wu, Xiao Wang, Xiang Ao, Dalin Zhang · Mar 22, 2026

What it does

Why it matters

The core idea: use an LLM agent to perturb topology and text jointly, creating 'cross-modal shortcuts' to mislead models without gradient access. This matters because TAG security is understudied and existing attacks fail when models use...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

BadGraph proposes the first universal adversarial attack on text-attributed graphs (TAGs) that works across both GNN and LLM backbones. The core idea: use an LLM agent to perturb topology and text jointly, creating 'cross-modal shortcuts' to mislead models without gradient access. This matters because TAG security is understudied and existing attacks fail when models use rich text encoders like SBERT or TAPE.

Critical review

Verdict

Bottom line

BadGraph identifies a real gap—no prior attacks effectively target both text and structure simultaneously across heterogeneous backbones—and delivers a practical, API-driven solution. The empirical results are strong, with up to 76.3% accuracy drops on DeepSeek-V3 and consistent degradation across GNN architectures (Table 1, Table 2). However, the 'universal' claim is tempered by limited backbone diversity (only two LLM reasoners, both decoder-only), and the theoretical analysis in Section 5 provides intuition rather than tight guarantees. The paper is a solid contribution to TAG security but oversells the theoretical grounding.

“BadGraph reduces DeepSeek's prediction accuracy by 76.3%, whereas the best-performing baseline, WTGIA, results in only an 11.3% drop”

BadGraph paper, Section 6.2.2 · Table 2

What holds up

The cross-modal alignment insight is the paper's strongest conceptual contribution. The authors demonstrate that attacking both modalities together significantly outperforms isolated attacks (Table 1: 'Our-text' and 'Our-struct' variants vs. full BadGraph). The ablation in Table 4 ('Anchor Mis.') confirms this: misaligning the influencer anchor drops accuracy by up to 22.66%. The homophily stability observation is well-supported empirically—Table 3 shows node- and edge-level homophily remain nearly unchanged (0.8184 → 0.8181) even as accuracy plummets from 64.6% to 22.6%. This validates that perturbations are localized and stealthy.

“Replacing the shared influencer node between textual and structural attacks leads to a sharp performance drop (up to 22.66%), demonstrating that the attack's effectiveness stems from semantic alignment across modalities”

BadGraph paper, Section 6.3.3 · Table 4

Main concerns

The theoretical framework (Section 5) has significant gaps. The homophily-preserving bound (Eq. 11) assumes a Lipschitz-continuous text encoder $L_\Phi$, which fails for discrete token embeddings and is only approximately true for neural encoders. The 'Cross-Modal Shortcut Theory' (Eq. 13) posits a synergy effect $\Delta_{\mathrm{joint}} > \Delta_{\delta_A} + \Delta_{\delta_S}$ but provides no formal proof—only an empirical correlation in Table 4. The LLM-as-reasoner setting (Table 2) raises eyebrows: Mistral-7B shows clean accuracy as low as 9.3% on Arxiv, suggesting either severe distribution shift from the training data or prompt sensitivity that the attack may be exploiting rather than 'fooling'. Finally, the claim that BadGraph requires 'only two LLM queries per node' (Section 4.2.2) omits the retrieval encoder training cost and the influencer selection queries, making the cost analysis incomplete.

“Let $x_i = \Phi(S_i)$ denote the textual embedding, where we assume that the encoder $\Phi$ is $L_\Phi$-Lipschitz continuous”

BadGraph paper, Section 5.1 · Eq. 11

Evidence and comparison

The comparison to WTGIA is partially misleading. WTGIA (Lei et al., 2024) is a graph injection attack—a different threat model—while BadGraph modifies existing nodes. The paper correctly notes that WTGIA's 'effectiveness decreases as text interpretability increases,' but this is an apples-to-oranges comparison for their shared 'text-level' framing. Against proper structural baselines (NETTACK, PGD, SGAttack), BadGraph mostly dominates, though the gap narrows on TAPE-encoded graphs where structure-only attacks already struggle. The LLM-as-reasoner experiments lack critical baselines: no gradient-based attacks are adapted for this setting, leaving open whether simpler methods could achieve similar results. The defense evaluation (Appendix B.1) is cursory—only RUNG and vanilla adversarial training are tested, and the adversarial training setup is unspecified.

“text interpretability, a factor previously overlooked at the embedding level, plays a crucial role in attack strength”

WTGIA paper, Lei et al., Sec. 1 · Abstract

Reproducibility

Reproducibility is a major concern. The attack relies on commercial API calls to DeepSeek-V3 and proprietary encoder weights (SBERT, TAPE) without disclosed prompts or full hyperparameters. While the paper states '$0.0009 per node' cost, API versioning and temperature settings affecting stochastic LLM outputs are not specified. The retrieval module uses different GNN encoders per dataset (GAT for Cora/Products, GIN for Arxiv) with no justification, and Table 9 shows attack success varies significantly with encoder choice (GCN vs. R-GCN vs. GAT). Most critically, the LLM-based node classification pipelines for DeepSeek and Mistral are zero-shot with undisclosed prompts, making the 'Clean' baseline potentially unstable—a concern given the extremely low baseline accuracies on some splits (e.g., 9.3%). Code is promised but not linked in the main paper.

“Mistral-7B and DeepSeek are employed in a zero-shot manner to perform node classification based on textual descriptions and their 2-hop neighborhoods”

BadGraph paper, Appendix A.2 · LLM-as-Reasoner description

Abstract

Text-attributed graphs (TAGs) enhance graph learning by integrating rich textual semantics and topological context for each node. While boosting expressiveness, they also expose new vulnerabilities in graph learning through text-based adversarial surfaces. Recent advances leverage diverse backbones, such as graph neural networks (GNNs) and pre-trained language models (PLMs), to capture both structural and textual information in TAGs. This diversity raises a key question: How can we design universal adversarial attacks that generalize across architectures to assess the security of TAG models? The challenge arises from the stark contrast in how different backbones-GNNs and PLMs-perceive and encode graph patterns, coupled with the fact that many PLMs are only accessible via APIs, limiting attacks to black-box settings. To address this, we propose BadGraph, a novel attack framework that deeply elicits large language models (LLMs) understanding of general graph knowledge to jointly perturb both node topology and textual semantics. Specifically, we design a target influencer retrieval module that leverages graph priors to construct cross-modally aligned attack shortcuts, thereby enabling efficient LLM-based perturbation reasoning. Experiments show that BadGraph achieves universal and effective attacks across GNN- and LLM-based reasoners, with up to a 76.3% performance drop, while theoretical and empirical analyses confirm its stealthy yet interpretable nature.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.