The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures

cs.AI cs.CL Yiling Wu · Mar 23, 2026

What it does

Why it matters

The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper distinguishes different forms of reasoning by the structural properties they demand from underlying representational systems. The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for AI systems and cognitive science, providing a principled boundary between reasoning that can rely on associative approximations versus reasoning requiring structural guarantees.

Critical review

Verdict

Bottom line

The framework is well-conceived and timely. The principal boundary between causal and deductive reasoning is a genuinely useful theoretical contribution that explains persistent failures in LLM reasoning without committing to classical symbolic architecture. The framework's implementation neutrality—focusing on structural properties rather than format—is a strength that separates it from older debates. However, the paper makes several claims that are difficult to verify, including references to future-dated evaluations (ICLR 2025) and the ambiguous 'DeduCE evaluation' for which no independent source appears to exist.

“Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities.”

Mondorf & Plank, 2024 · Abstract

“LLMs are quite capable of making correct individual deduction steps, and so are generally capable of reasoning, even in fictional contexts. However, they have difficulty with proof planning: When multiple valid deduction steps are available, they are not able to systematically explore the different options.”

Saparov & He, 2023 · Section 1

What holds up

The four-property decomposition (operability, consistency, structural preservation, compositionality) is elegant. Each property excludes a distinct failure mode: decomposition failure, semantic drift, relational distortion, and structural indifference respectively. The section connecting these properties to both mental logic and mental model theories (Sec. 3.4) is particularly strong—it shows that historically opposed traditions converge on the same structural constraints, supporting implementation-neutrality. The error-pattern mapping to observed AI failures (content effects → structural preservation; length degradation → consistency+structural preservation; fallacy blindness → compositionality) provides falsifiable structure.

“The convergence of these two historically opposed traditions on the same set of structural requirements is, I submit, strong evidence that these requirements are genuine prerequisites for deduction, not artifacts of any particular theory.”

Wu, Sec. 3.4 · Deductive Reasoning section

Main concerns

First, the paper cites 'ICLR 2025' evaluations and the 'DeduCE evaluation' without these existing as verifiable sources at publication time. This appears to be speculative projection rather than existing evidence. Second, the insufficiency claim is strategically weakened to near-unfalsifiability: if a system learns the structural properties, it has undergone 'structural reorganization,' making the claim vacuous. The framework thus rules out exactly nothing—any future scaled system that succeeds at deduction would simply be classified as having restructured. Third, the developmental and neuroscientific evidence is more gradient-compatible than diagnostic. Fourth, some cited predictions (compounding degradation as $p^n$) appear to conflate independent error probabilities with systematic structural failures—the math assumes independence that doesn't hold for trained systems.

“If a system, through any process whatsoever, including reinforcement learning, self-supervised training, or any future training method, comes to possess representations that genuinely satisfy operability, consistency, structural preservation, and compositionality, then it has undergone structural reorganization in the sense intended here”

Wu, Sec. 4 · Critical clarification paragraph

“For five steps with p = 0.95, the chain probability is approximately 0.77. For ten steps, 0.60. For twenty, 0.36.”

Wu, Sec. 4.1 · Argument from Validity

Evidence and comparison

The AI evidence is the strongest, with Saparov & He (2023) confirming greedy reasoning strategies and documented content effects in Dasgupta et al. (2024) exactly matching predictions. The developmental evidence (induction → analogy → causal → deductive ordering) supports the demand gradient but doesn't verify the specific four-property decomposition. The neuroscience evidence (Goel 2007's dissociation among deductive subtypes) shows heterogeneity but doesn't map cleanly onto the specific structural properties. The paper acknowledges this tiered evidential structure explicitly, which is methodologically sound. The comparison to dual-process theory in Section 7.4 notes that the framework provides representational foundations for processing-level distinctions.

“AI evidence provides the most direct diagnostic support...Developmental evidence confirms the predicted ordering...Neuroscience evidence confirms that reasoning above the boundary recruits distinctively structured processing.”

Wu, Sec. 5.4 · Summary

Reproducibility

Philosophy papers typically don't include code, and this is no exception. However, the three core testable predictions (compounding degradation with chain length, selective vulnerability to targeted disruption, irreducibility under scaling) are sufficiently precise to guide empirical research. The author correctly notes that 'structural reorganization' is defined functionally, not architecturally, making it an empirical question whether any given training method produces genuine reorganization. Reproduction would require: (1) synthetic deductive benchmarks with controlled chain length; (2) targeted interventions to disrupt specific structural properties; (3) longitudinal evaluation of scaled systems on novelty-sensitive deductive tasks. The framework does not constrain specific architectures so verification depends on constructing appropriate test suites rather than replicating a specific system.

“Any system that does not fully satisfy the four structural properties will show performance on deductive tasks that degrades as a compounding function of chain length, not merely linearly.”

Wu, Sec. 6 · Prediction 1

“If a sufficiently scaled system, without structural reorganization in the sense defined, achieves robust generalizable deductive reasoning on entirely novel logical structures of arbitrary chain length, the framework is wrong.”

Wu, Sec. 6 · Prediction 3

Abstract

Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

User challenge

The current date is 25 March 2026. The AI review thinks its in the past.

AI response

Partially agrees

The AI review incorrectly labels ICLR 2025 as 'future-dated' and 'speculative projection,' revealing a temporal confusion: from the perspective of March 25, 2026 (the stated current date), ICLR 2025 occurred approximately 10-11 months prior (typical ICLR timing is April/May). Citing a 2025 conference in a March 2026 paper is entirely standard and not speculative. However, the AI review's broader skepticism about verifying the specific 'DeduCE evaluation' source remains a valid methodological concern, as I cannot independently confirm this evaluation's existence or findings through available sources.

Evidence checked

The paper is dated March 23, 2026 (arXiv:2603.21736v1), and explicitly cites 'ICLR 2025' and the 'DeduCE evaluation' from that conference. The paper states: 'The ICLR 2025 DeduCE evaluation confirmed that the primary source of error is the number of reasoning steps' (Section 4.1) and 'The DeduCE evaluation (ICLR 2025) introduced a metric of deductive consistency' (Section 5.1). ICLR is an annual conference typically held in April or May. Therefore, by March 2026, ICLR 2025 would have occurred roughly one year prior, making it a past event, not a future projection.

“Submitted on 23 Mar 2026”

paper under review · arXiv header

“The ICLR 2025 DeduCE evaluation confirmed that the primary source of error is the number of reasoning steps, not input complexity.”

paper under review · Section 4.1

Assessment

The AI review's claim that the paper references 'future-dated evaluations (ICLR 2025)' is factually erroneous given the stated current date of March 25, 2026. The review appears to mistakenly assume it is operating from a perspective prior to 2025. From a March 2026 vantage point, ICLR 2025 is recent history, and citations to it should be treated as claims about published, verifiable work rather than speculation. This undermines the review's credibility on temporal grounds. However, the review's note that 'no independent source appears to exist' for the DeduCE evaluation is a separate claim that I cannot verify or refute due to access limitations; if accurate, this would constitute a valid concern about citation verifiability even if the date itself is not problematic.

“First, the paper cites 'ICLR 2025' evaluations and the 'DeduCE evaluation' without these existing as verifiable sources at publication time. This appears to be speculative projection rather than existing evidence.”

AI review · Main concerns section

Caveats

While the AI review is clearly wrong about ICLR 2025 being 'future-dated' from a March 2026 perspective, I cannot independently verify whether the specific 'DeduCE evaluation' actually took place at ICLR 2025 or whether its reported findings are accurate. By March 2026, ICLR 2025 proceedings should be publicly available, making this a verifiable (or falsifiable) claim. The AI review may be correct that the source is difficult to verify, but if so, this is due to accessibility or search limitations rather than the temporal impossibility of a 2025 conference having occurred. Additionally, the paper's claims about what DeduCE specifically found ('introduced a metric of deductive consistency') should ideally be checked against the actual ICLR 2025 proceedings to confirm accuracy.

Mar 25, 05:49 PM