Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles

cs.CL Adi Gabay, Gabriel Stanovsky, Liat Peterfreund · Mar 22, 2026
Local to this browser
What it does
This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles...
Why it matters
They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is...
Main concern
The paper makes a valuable conceptual contribution by reframing memorization as reduction, a broader and more cognitively plausible mechanism. The reduction ladder is a well-designed methodological tool that systematically decouples...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles like the Muddy Children problem. The authors persuasively argue that memorization is better understood as a special case of reduction, where models map new instances to known problems. They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is required. The work reframes how we interpret LLM performance on canonical reasoning benchmarks and highlights that strong accuracy on classic puzzles may mask a lack of genuine reasoning capability.

Critical review
Verdict
Bottom line

The paper makes a valuable conceptual contribution by reframing memorization as reduction, a broader and more cognitively plausible mechanism. The reduction ladder is a well-designed methodological tool that systematically decouples surface form from underlying logical structure. The finding that Gemini 2.5 Pro achieves 97.6% accuracy on Rung I but only 66.8% on Rung III (Table 2), with a corresponding drop in reductive chain-of-thought reasoning from 62% to 15%, strongly supports the claim that models rely on reduction rather than epistemic reasoning. The work is notable for acknowledging limitations around the CoT-reasoning disconnect.

“Rung III: 66.8% accuracy, 15% reductive CoT; Rung I: 97.6% accuracy, 62% reductive CoT”
Gabay et al., Table 2 · Section 4.2
“There may exist a disconnect between the model's verbalized reasoning and its actual computation”
paper · Section 6
What holds up

The theoretical distinction between reduction and epistemic reasoning is compelling and better aligned with cognitive science (Gentner, 1983) than the coarse memorization-versus-reasoning dichotomy. The reduction ladder progressively obscures the Muddy Children structure—moving from classic muddy children (Rung I) to Olympic gymnasts with conflicting world knowledge (Rung II) to non-symmetric observation matrices (Rung III)—while preserving the underlying $k$-bound epistemic logic. The manual CoT analysis provides interpretable evidence: the model explicitly references the classic puzzle on Rungs I-II but switches to formal epistemic terminology (Kripke models, possible worlds elimination) on Rung III where reduction fails.

“memorization is the case in which $f(x)=x$”
paper · Section 1
“In Rung III ... reductive language largely disappears and is replaced by formal epistemic terminology, including references to Kripke models”
paper · Section 4.2
Main concerns

The sample size imbalance is problematic: only 374 puzzles for Rung III versus 1,320 for Rungs I and II, attributed to 'technical complexity of generating solvable observation matrices.' This smaller sample reduces statistical confidence in the Rung III results where the key claim—that all models struggle with epistemic reasoning—is supported. Four models were excluded from main analysis for failing to beat majority vote (e.g., OLMo 3-7B-Instruct answered 'No' in 96.7% of instances), which limits the generalizability of findings across model scales. The potential confound in Rung II—where models must override parametric knowledge about famous gymnasts—is interesting but not cleanly isolated from the reduction construct.

“1,320 puzzles for each of Rungs I and II, and 374 puzzles for Rung III (due to the technical complexity of generating solvable observation matrices)”
paper · Section 4.2
“Since some models fail to outperform the majority vote on Rung I ... we exclude them from the main analysis”
paper · Section 4.1
Evidence and comparison

The evidence supports the core claim about the reduction-epistemic distinction. The correlation between reductive CoT language and accuracy (Table 2) is particularly convincing. Comparisons to prior work are generally fair: the authors correctly note Jiang et al. (2024) found models rely on superficial patterns, and Sileo & Lernould (2023) used dynamic epistemic logic with non-symmetric observations. However, the claim that prior work framed behavior as only 'epistemic reasoning or brittle memorization' slightly oversimplifies the nuanced positions in those papers—Jiang et al. explicitly discuss pattern matching beyond rote memorization, and Sileo & Lernould examine scaling trends rather than dichotomous strategies. The ladder methodology represents a genuine advance over prior perturbation approaches.

“their success largely depends on recognizing superficial patterns with strong token bias”
Jiang et al., 2024 · Abstract
“leverage dynamic epistemic logic to isolate a particular component of ToM and to generate controlled problems”
Reproducibility

The paper provides adequate experimental detail: prompts are shown in Figures 6-8 and Appendix A, temperature settings are specified (1.0 for GPT-5, 0.0 for others), and the symbolic solver for verification is referenced. The problem parameterization $(n,k,t,q,j,O)$ with observation matrix $O \in \{0,1\}^{n \times n}$ is clearly defined. However, no code repository URL or dataset link is provided in the text, which would be necessary for full reproduction. The majority-vote baseline of 41.7% is mentioned but not explained. The exclusion of four models from main analysis without reporting their performance metrics in detail limits the ability to assess whether scale or architecture explains the reduction-epistemic gap.

“default temperature of 1 for the GPT-5 models and a temperature of 0 for all other models”
paper · Section 4.1
“O \in \{0,1\}^{n \times n} is an observation matrix where the entry $O_{i,i'}=1$ if agent $i$ observes the status of agent $i'$”
paper · Appendix A.2
Abstract

Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.