GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

cs.AI cs.CL cs.AI Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan · Mar 31, 2026
Local to this browser
What it does
GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted...
Why it matters
Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$)....
Main concern
The paper presents a methodologically rigorous benchmark that fills a genuine gap in evaluating LLM user understanding. The decomposition of Interest Groundedness into precision ($IG_P$) and recall ($IG_R$) components effectively separates...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$). The authors find that current LLMs struggle primarily with recall—discovering all verifiable interests—rather than hallucination, revealing critical bottlenecks in evidence counting across heterogeneous signal types.

Critical review
Verdict
Bottom line

The paper presents a methodologically rigorous benchmark that fills a genuine gap in evaluating LLM user understanding. The decomposition of Interest Groundedness into precision ($IG_P$) and recall ($IG_R$) components effectively separates hallucination from incomplete coverage, revealing that \u0022coverage, not hallucination, is the primary bottleneck\u0022 (Table 4). The validation against user surveys ($\rho = 0.67$) provides credible external validity, though the reliance on an ensemble oracle for recall computation introduces dependencies on the specific model set evaluated. The framework successfully shifts evaluation from plausibility-based judging to verifiable evidence-based auditing.

“coverage, not hallucination, is the primary bottleneck across all models”
Paper · Table 4 caption
“The Spearman correlation between survey F1 and the geometric mean of IGF1 and IS is $\rho = 0.67$”
Paper · Section 6.1
What holds up

The taxonomy normalization approach is particularly strong, collapsing semantically redundant predictions (e.g., \u0022NBA Trade Rumors\u0022 and \u0022NBA Free Agency News\u0022) into single categories to prevent gaming via lexical diversity. The evidence-based verification framework—requiring configurable thresholds of explicit ($\geq 2$) and implicit ($\geq 3$) positive signals via $\text{Verified}(I_j) = \phi_{\mathcal{D}}(n_{exp}^+, n_{imp}^+, n_{exp}^-, n_{imp}^-)$—provides principled groundedness evaluation without ground-truth labels. The failure mode analysis is concrete: the finding that \u002292–99% of failed interests lack enough explicit positive signals\u0022 (Table 7) precisely identifies evidence counting as the critical capability gap.

“Verified(I_j) = \phi_{\mathcal{D}}(n_{exp}^+, n_{imp}^+, n_{exp}^-, n_{imp}^-)”
Paper · Section 4.2
“92–99% of failed interests lack enough explicit positive signals, and 74–97% lack enough implicit positive signals”
Paper · Table 6.2 Finding 2
Main concerns

The benchmark relies on synthetic users constructed via cohort aggregation, raising questions about fidelity to individual user understanding despite validation claims. The verification thresholds, while domain-informed, lack sensitivity analysis—the authors note \u0022a formal threshold sensitivity analysis is left for future work\u0022 (Section 4.2). The use of a single LLM judge (Llama-3.3-70B-Instruct) for both evidence filtering and specificity evaluation introduces potential systemic biases; the authors admit they \u0022do not ablate on judge model choice\u0022 (Section 7). Additionally, IG Recall scores depend on an oracle constructed from the union of all evaluated models, making comparisons across different evaluation runs non-trivial.

“a formal threshold sensitivity analysis is left for future work”
Paper · Section 4.2
“Both the IG evidence-filtering judge and the IS retrieval judge use a single model (Llama-3.3-70B-Instruct). We provide inter-annotator agreement analysis in Appendix 14, but do not ablate on judge model choice”
Paper · Section 7 Limitations
Evidence and comparison

The evaluation across five heterogeneous datasets (KuaiRec, MIND, Amazon Music, Goodreads, and synthetic) strengthens generalizability, particularly the finding that \u0022signal type determines difficulty more than model identity\u0022 (Section 6.2). However, the comparison with concurrent ALPBench (Section 2) is cursory, asserting complementarity without empirical validation. The precision-recall decomposition reveals that $IG_P \gg IG_R$ universally (Table 4), indicating models are precise but incomplete. While the distinction between IG (faithfulness to evidence) and IS (plausibility/specificity) is well-articulated following the Jacovi-Goldberg framework, the relationship to downstream recommendation performance remains unexplored by design.

“signal type determines difficulty more than model identity”
Paper · Section 6.2 Finding 2
“$IG_P \gg IG_R$ universally: coverage, not hallucination, is the primary bottleneck”
Paper · Table 4 caption
Reproducibility

The authors provide detailed methodological descriptions including prompt construction, threshold specifications, and the 325-category taxonomy, with code released at https://github.com/facebookresearch/GISTBench. However, the synthetic dataset relies on proprietary VLM-generated descriptions from Meta platforms that cannot be fully replicated. The use of an internal inference framework and lack of ablation on judge model choice (both IG and IS use Llama-3.3-70B-Instruct) represent barriers to exact reproduction. While all evaluated models are open-weight, the ensemble oracle construction means IG Recall scores \u0022should not be compared across evaluation runs with different model sets\u0022 (Section 6), complicating longitudinal benchmarking.

“IGR scores should not be compared across evaluation runs with different model sets”
Paper · Section 6
“https://github.com/facebookresearch/GISTBench”
Paper · Metadata
Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.