The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures
This paper distinguishes different forms of reasoning by the structural properties they demand from underlying representational systems. The core insight is that deduction requires four specific properties (operability, consistency, structural preservation, and compositionality) that cannot be secured through mere statistical scaling. This has significant implications for AI systems and cognitive science, providing a principled boundary between reasoning that can rely on associative approximations versus reasoning requiring structural guarantees.
The framework is well-conceived and timely. The principal boundary between causal and deductive reasoning is a genuinely useful theoretical contribution that explains persistent failures in LLM reasoning without committing to classical symbolic architecture. The framework's implementation neutrality—focusing on structural properties rather than format—is a strength that separates it from older debates. However, the paper makes several claims that are difficult to verify, including references to future-dated evaluations (ICLR 2025) and the ambiguous 'DeduCE evaluation' for which no independent source appears to exist.
The four-property decomposition (operability, consistency, structural preservation, compositionality) is elegant. Each property excludes a distinct failure mode: decomposition failure, semantic drift, relational distortion, and structural indifference respectively. The section connecting these properties to both mental logic and mental model theories (Sec. 3.4) is particularly strong—it shows that historically opposed traditions converge on the same structural constraints, supporting implementation-neutrality. The error-pattern mapping to observed AI failures (content effects → structural preservation; length degradation → consistency+structural preservation; fallacy blindness → compositionality) provides falsifiable structure.
First, the paper cites 'ICLR 2025' evaluations and the 'DeduCE evaluation' without these existing as verifiable sources at publication time. This appears to be speculative projection rather than existing evidence. Second, the insufficiency claim is strategically weakened to near-unfalsifiability: if a system learns the structural properties, it has undergone 'structural reorganization,' making the claim vacuous. The framework thus rules out exactly nothing—any future scaled system that succeeds at deduction would simply be classified as having restructured. Third, the developmental and neuroscientific evidence is more gradient-compatible than diagnostic. Fourth, some cited predictions (compounding degradation as $p^n$) appear to conflate independent error probabilities with systematic structural failures—the math assumes independence that doesn't hold for trained systems.
The AI evidence is the strongest, with Saparov & He (2023) confirming greedy reasoning strategies and documented content effects in Dasgupta et al. (2024) exactly matching predictions. The developmental evidence (induction → analogy → causal → deductive ordering) supports the demand gradient but doesn't verify the specific four-property decomposition. The neuroscience evidence (Goel 2007's dissociation among deductive subtypes) shows heterogeneity but doesn't map cleanly onto the specific structural properties. The paper acknowledges this tiered evidential structure explicitly, which is methodologically sound. The comparison to dual-process theory in Section 7.4 notes that the framework provides representational foundations for processing-level distinctions.
Philosophy papers typically don't include code, and this is no exception. However, the three core testable predictions (compounding degradation with chain length, selective vulnerability to targeted disruption, irreducibility under scaling) are sufficiently precise to guide empirical research. The author correctly notes that 'structural reorganization' is defined functionally, not architecturally, making it an empirical question whether any given training method produces genuine reorganization. Reproduction would require: (1) synthetic deductive benchmarks with controlled chain length; (2) targeted interventions to disrupt specific structural properties; (3) longitudinal evaluation of scaled systems on novelty-sensitive deductive tasks. The framework does not constrain specific architectures so verification depends on constructing appropriate test suites rather than replicating a specific system.
Different types of reasoning impose different structural demands on representational systems, yet no systematic account of these demands exists across psychology, AI, and philosophy of mind. I propose a framework identifying four structural properties of representational systems: operability, consistency, structural preservation, and compositionality. These properties are demanded to different degrees by different forms of reasoning, from induction through analogy and causal inference to deduction and formal logic. Each property excludes a distinct class of reasoning failure. The analysis reveals a principal structural boundary: reasoning types below it can operate on associative, probabilistic representations, while those above it require all four properties to be fully satisfied. Scaling statistical learning without structural reorganization is insufficient to cross this boundary, because the structural guarantees required by deductive reasoning cannot be approximated through probabilistic means. Converging evidence from AI evaluation, developmental psychology, and cognitive neuroscience supports the framework at different levels of directness. Three testable predictions are derived, including compounding degradation, selective vulnerability to targeted structural disruption, and irreducibility under scaling. The framework is a necessary-condition account, agnostic about representational format, that aims to reorganize existing debates rather than close them.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
The AI review incorrectly labels ICLR 2025 as 'future-dated' and 'speculative projection,' revealing a temporal confusion: from the perspective of March 25, 2026 (the stated current date), ICLR 2025 occurred approximately 10-11 months prior (typical ICLR timing is April/May). Citing a 2025 conference in a March 2026 paper is entirely standard and not speculative. However, the AI review's broader skepticism about verifying the specific 'DeduCE evaluation' source remains a valid methodological concern, as I cannot independently confirm this evaluation's existence or findings through available sources.
The paper is dated March 23, 2026 (arXiv:2603.21736v1), and explicitly cites 'ICLR 2025' and the 'DeduCE evaluation' from that conference. The paper states: 'The ICLR 2025 DeduCE evaluation confirmed that the primary source of error is the number of reasoning steps' (Section 4.1) and 'The DeduCE evaluation (ICLR 2025) introduced a metric of deductive consistency' (Section 5.1). ICLR is an annual conference typically held in April or May. Therefore, by March 2026, ICLR 2025 would have occurred roughly one year prior, making it a past event, not a future projection.
The AI review's claim that the paper references 'future-dated evaluations (ICLR 2025)' is factually erroneous given the stated current date of March 25, 2026. The review appears to mistakenly assume it is operating from a perspective prior to 2025. From a March 2026 vantage point, ICLR 2025 is recent history, and citations to it should be treated as claims about published, verifiable work rather than speculation. This undermines the review's credibility on temporal grounds. However, the review's note that 'no independent source appears to exist' for the DeduCE evaluation is a separate claim that I cannot verify or refute due to access limitations; if accurate, this would constitute a valid concern about citation verifiability even if the date itself is not problematic.
While the AI review is clearly wrong about ICLR 2025 being 'future-dated' from a March 2026 perspective, I cannot independently verify whether the specific 'DeduCE evaluation' actually took place at ICLR 2025 or whether its reported findings are accurate. By March 2026, ICLR 2025 proceedings should be publicly available, making this a verifiable (or falsifiable) claim. The AI review may be correct that the source is difficult to verify, but if so, this is due to accessibility or search limitations rather than the temporal impossibility of a 2025 conference having occurred. Additionally, the paper's claims about what DeduCE specifically found ('introduced a metric of deductive consistency') should ideally be checked against the actual ICLR 2025 proceedings to confirm accuracy.