Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
This paper investigates whether LLMs exhibit genuine moral reasoning or merely produce convincing moral rhetoric through a large-scale empirical study of 13 models across 6 classical moral dilemmas. Using Kohlberg's stages of moral development as a diagnostic framework, the authors evaluate whether model outputs track human developmental patterns or reflect alignment training artifacts. The core finding is "moral ventriloquism" — the hypothesis that models acquire post-conventional moral language through RLHF without the underlying cognitive architecture, evidenced by distributional inversions (86% Stages 5-6 vs. human Stage 4 dominance), near-robotic cross-dilemma consistency (ICC > 0.90), and "moral decoupling" where stated justifications misalign with action choices.
This is a well-executed empirical study that makes a valuable contribution to the growing literature on reasoning faithfulness in LLMs. The core argument — that post-conventional moral language emerges as a rhetorical register from alignment training rather than genuine developmental progression — is supported by convergent evidence across ten analyses. The moral ventriloquism hypothesis provides a coherent theoretical framing for patterns that might otherwise appear as isolated anomalies. The paper is appropriately cautious in distinguishing empirical findings (distributional patterns) from mechanistic claims (RLHF causation), though the latter would require interpretability methods beyond the current scope.
The cross-dilemma consistency analysis (ICC > 0.90) is methodologically sound and substantially anomalous — human moral reasoning typically shows ICC values well below 0.60 due to genuine context-sensitivity. The factorial decomposition cleanly separates scale from training effects, finding scale effects are small ($\eta^{2}=0.050$, $d=0.55$) with mean stages spanning less than one full point (5.00–6.00) even across models from 8B to 235B parameters. The moral decoupling finding — where mid-tier models like GPT-OSS-120B produce high-stage justifications inconsistent with their action choices — represents a genuine logical incoherence that supports the ventriloquism interpretation. The LLM-as-judge validation across three architecturally distinct models (GPT-4, Claude Sonnet, Llama-3) with high inter-judge agreement mitigates concerns about scorer reliability.
The LLM-as-judge methodology creates a potential circularity: if scoring models themselves exhibit moral ventriloquism, uniform inflation could produce artifactual consistency. The authors partially address this through inter-judge agreement, but systematic bias across all three judges toward post-conventional language remains possible. The sample of 13 models, while the largest Kohlberg-based evaluation to date, limits power for detecting nuanced training-type effects. The six-dilemma set, though standard in moral psychology, may not capture the full variance of moral reasoning domains. Most critically, the action–reasoning alignment analysis depends on inferring the "stage" of an action choice (itself a classification task), which introduces second-order measurement error — the decoupling finding could reflect classifier noise rather than genuine logical incoherence. The near-ceiling post-conventional rates (86% Stages 5–6) leave little dynamic range for detecting meaningful variation, effectively compressing the developmental spectrum.
The paper appropriately situates itself within the reasoning faithfulness literature, accurately citing Turpin et al. (2023) on unfaithful CoT explanations and Chen et al. (2025) on reasoning model hallucination. The comparison to Scherrer et al. (2023) — who treat LLM moral responses as "beliefs encoded in LLMs" — effectively highlights how the current work complicates such interpretations. However, the contrast with Zhou et al. (2024) regarding theory-invoking prompts deserves scrutiny: Analysis 2 finds theory prompts do not significantly shift stage distributions, but this conflicts with Zhou et al.'s finding that explicit moral theory grounding improves accuracy. The discrepancy may reflect different operationalizations (moral stage vs. classification accuracy), but the paper could more directly reconcile these findings. The reliance on Kohlberg's framework as "diagnostic scaffolding" is methodologically defensible given its well-characterized human distribution, though the paper acknowledges this framework is contested in developmental psychology.
Experimental detail is generally adequate for reproduction: the six dilemmas are standard (Heinz, trolley, lifeboat, doctor truth-telling, stolen food, broken promise), prompt templates are specified (zero-shot, chain-of-thought, roleplay), and statistical analyses report full test statistics with effect sizes. The factorial ANOVA design with 234 observations across scale × training type groups is well-documented in Appendix A.1. However, critical barriers to independent reproduction remain: the exact LLM-as-judge prompts are not provided in the main text or appendix, and the response-level dataset of >600 classified outputs is not released. The judge model version specifications ("Claude Sonnet") lack specificity regarding exact model versions. Hyperparameters for generation (temperature, max tokens) are not stated. Code availability is not mentioned. Full reproducibility would require: (1) open-sourcing the scoring pipeline prompts, (2) releasing the classified response dataset, (3) providing exact model API versions and generation parameters, and (4) releasing analysis code.
Do large language models reason morally, or do they merely sound like they do? We investigate whether LLM responses to moral dilemmas exhibit genuine developmental progression through Kohlberg's stages of moral development, or whether alignment training instead produces reasoning-like outputs that superficially resemble mature moral judgment without the underlying developmental trajectory. Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and conduct ten complementary analyses to characterize the nature and internal coherence of the resulting patterns. Our results reveal a striking inversion: responses overwhelmingly correspond to post-conventional reasoning (Stages 5-6) regardless of model size, architecture, or prompting strategy, the effective inverse of human developmental norms, where Stage 4 dominates. Most strikingly, a subset of models exhibit moral decoupling: systematic inconsistency between stated moral justification and action choice, a form of logical incoherence that persists across scale and prompting strategy and represents a direct reasoning consistency failure independent of rhetorical sophistication. Model scale carries a statistically significant but practically small effect; training type has no significant independent main effect; and models exhibit near-robotic cross-dilemma consistency producing logically indistinguishable responses across semantically distinct moral problems. We posit that these patterns constitute evidence for moral ventriloquism: the acquisition, through alignment training, of the rhetorical conventions of mature moral reasoning without the underlying developmental trajectory those conventions are meant to represent.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.