Beyond Memorization: Distinguishing between Reductive and Epistemic Reasoning in LLMs using Classic Logic Puzzles
This paper tackles the challenge of evaluating whether large language models perform genuine epistemic reasoning—reasoning about knowledge and partial observations in multi-agent systems—or simply rely on memorization of classic puzzles like the Muddy Children problem. The authors persuasively argue that memorization is better understood as a special case of reduction, where models map new instances to known problems. They introduce a reduction ladder with progressively modified puzzle variants to distinguish reductive from epistemic reasoning, finding that while some models succeed through reduction, all struggle when true epistemic reasoning is required. The work reframes how we interpret LLM performance on canonical reasoning benchmarks and highlights that strong accuracy on classic puzzles may mask a lack of genuine reasoning capability.
The paper makes a valuable conceptual contribution by reframing memorization as reduction, a broader and more cognitively plausible mechanism. The reduction ladder is a well-designed methodological tool that systematically decouples surface form from underlying logical structure. The finding that Gemini 2.5 Pro achieves 97.6% accuracy on Rung I but only 66.8% on Rung III (Table 2), with a corresponding drop in reductive chain-of-thought reasoning from 62% to 15%, strongly supports the claim that models rely on reduction rather than epistemic reasoning. The work is notable for acknowledging limitations around the CoT-reasoning disconnect.
The theoretical distinction between reduction and epistemic reasoning is compelling and better aligned with cognitive science (Gentner, 1983) than the coarse memorization-versus-reasoning dichotomy. The reduction ladder progressively obscures the Muddy Children structure—moving from classic muddy children (Rung I) to Olympic gymnasts with conflicting world knowledge (Rung II) to non-symmetric observation matrices (Rung III)—while preserving the underlying $k$-bound epistemic logic. The manual CoT analysis provides interpretable evidence: the model explicitly references the classic puzzle on Rungs I-II but switches to formal epistemic terminology (Kripke models, possible worlds elimination) on Rung III where reduction fails.
The sample size imbalance is problematic: only 374 puzzles for Rung III versus 1,320 for Rungs I and II, attributed to 'technical complexity of generating solvable observation matrices.' This smaller sample reduces statistical confidence in the Rung III results where the key claim—that all models struggle with epistemic reasoning—is supported. Four models were excluded from main analysis for failing to beat majority vote (e.g., OLMo 3-7B-Instruct answered 'No' in 96.7% of instances), which limits the generalizability of findings across model scales. The potential confound in Rung II—where models must override parametric knowledge about famous gymnasts—is interesting but not cleanly isolated from the reduction construct.
The evidence supports the core claim about the reduction-epistemic distinction. The correlation between reductive CoT language and accuracy (Table 2) is particularly convincing. Comparisons to prior work are generally fair: the authors correctly note Jiang et al. (2024) found models rely on superficial patterns, and Sileo & Lernould (2023) used dynamic epistemic logic with non-symmetric observations. However, the claim that prior work framed behavior as only 'epistemic reasoning or brittle memorization' slightly oversimplifies the nuanced positions in those papers—Jiang et al. explicitly discuss pattern matching beyond rote memorization, and Sileo & Lernould examine scaling trends rather than dichotomous strategies. The ladder methodology represents a genuine advance over prior perturbation approaches.
The paper provides adequate experimental detail: prompts are shown in Figures 6-8 and Appendix A, temperature settings are specified (1.0 for GPT-5, 0.0 for others), and the symbolic solver for verification is referenced. The problem parameterization $(n,k,t,q,j,O)$ with observation matrix $O \in \{0,1\}^{n \times n}$ is clearly defined. However, no code repository URL or dataset link is provided in the text, which would be necessary for full reproduction. The majority-vote baseline of 41.7% is mentioned but not explained. The exclusion of four models from main analysis without reporting their performance metrics in detail limits the ability to assess whether scale or architecture explains the reduction-epistemic gap.
Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on canonical epistemic puzzles interpreted their behavior through a dichotomy between epistemic reasoning and brittle memorization. We argue that this framing is incomplete: in recent models, memorization is better understood as a special case of reduction, where a new instance is mapped onto a known problem. Instead, we introduce a reduction ladder, a sequence of modifications that progressively move instances away from a canonical epistemic puzzle, making reduction increasingly difficult while preserving the underlying logic. We find that while some large models succeed via reduction, other models fail early, and all models struggle once epistemic reasoning is required.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.