GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
GISTBench evaluates whether LLMs can accurately extract user interests from behavioral interaction histories in recommendation systems. Unlike traditional benchmarks that optimize for item prediction accuracy, it verifies if predicted interests are actually grounded in engagement signals using two novel metrics: Interest Groundedness ($IG$) and Interest Specificity ($IS$). The authors find that current LLMs struggle primarily with recall—discovering all verifiable interests—rather than hallucination, revealing critical bottlenecks in evidence counting across heterogeneous signal types.
The paper presents a methodologically rigorous benchmark that fills a genuine gap in evaluating LLM user understanding. The decomposition of Interest Groundedness into precision ($IG_P$) and recall ($IG_R$) components effectively separates hallucination from incomplete coverage, revealing that \u0022coverage, not hallucination, is the primary bottleneck\u0022 (Table 4). The validation against user surveys ($\rho = 0.67$) provides credible external validity, though the reliance on an ensemble oracle for recall computation introduces dependencies on the specific model set evaluated. The framework successfully shifts evaluation from plausibility-based judging to verifiable evidence-based auditing.
The taxonomy normalization approach is particularly strong, collapsing semantically redundant predictions (e.g., \u0022NBA Trade Rumors\u0022 and \u0022NBA Free Agency News\u0022) into single categories to prevent gaming via lexical diversity. The evidence-based verification framework—requiring configurable thresholds of explicit ($\geq 2$) and implicit ($\geq 3$) positive signals via $\text{Verified}(I_j) = \phi_{\mathcal{D}}(n_{exp}^+, n_{imp}^+, n_{exp}^-, n_{imp}^-)$—provides principled groundedness evaluation without ground-truth labels. The failure mode analysis is concrete: the finding that \u002292–99% of failed interests lack enough explicit positive signals\u0022 (Table 7) precisely identifies evidence counting as the critical capability gap.
The benchmark relies on synthetic users constructed via cohort aggregation, raising questions about fidelity to individual user understanding despite validation claims. The verification thresholds, while domain-informed, lack sensitivity analysis—the authors note \u0022a formal threshold sensitivity analysis is left for future work\u0022 (Section 4.2). The use of a single LLM judge (Llama-3.3-70B-Instruct) for both evidence filtering and specificity evaluation introduces potential systemic biases; the authors admit they \u0022do not ablate on judge model choice\u0022 (Section 7). Additionally, IG Recall scores depend on an oracle constructed from the union of all evaluated models, making comparisons across different evaluation runs non-trivial.
The evaluation across five heterogeneous datasets (KuaiRec, MIND, Amazon Music, Goodreads, and synthetic) strengthens generalizability, particularly the finding that \u0022signal type determines difficulty more than model identity\u0022 (Section 6.2). However, the comparison with concurrent ALPBench (Section 2) is cursory, asserting complementarity without empirical validation. The precision-recall decomposition reveals that $IG_P \gg IG_R$ universally (Table 4), indicating models are precise but incomplete. While the distinction between IG (faithfulness to evidence) and IS (plausibility/specificity) is well-articulated following the Jacovi-Goldberg framework, the relationship to downstream recommendation performance remains unexplored by design.
The authors provide detailed methodological descriptions including prompt construction, threshold specifications, and the 325-category taxonomy, with code released at https://github.com/facebookresearch/GISTBench. However, the synthetic dataset relies on proprietary VLM-generated descriptions from Meta platforms that cannot be fully replicated. The use of an internal inference framework and lack of ablation on judge model choice (both IG and IS use Llama-3.3-70B-Instruct) represent barriers to exact reproduction. While all evaluated models are open-weight, the ensemble oracle construction means IG Recall scores \u0022should not be compared across evaluation runs with different model sets\u0022 (Section 6), complicating longitudinal benchmarking.
We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user surveys, and evaluate eight open-weight LLMs spanning 7B to 120B parameters. Our findings reveal performance bottlenecks in current LLMs, particularly their limited ability to accurately count and attribute engagement signals across heterogeneous interaction types.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.