What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators
World models for reinforcement learning learn to simulate environment dynamics, yet what they represent internally remains unclear. This paper probes two architecturally distinct models—IRIS (a discrete token transformer) and DIAMOND (a continuous diffusion UNet)—on Atari Breakout and Pong using linear and MLP probes, causal interventions, and attention analysis to test whether they develop structured, interpretable representations of game state. The core finding is that world models develop approximately linear representations of salient state variables (ball position, score) that are not merely correlated but functionally used during prediction.
The paper provides compelling empirical evidence that learned world models develop approximately linear, functionally-causal representations of game state across two distinct architectures. The causal intervention analysis—showing strong correlations ($r > 0.95$) between activation shifts along probe directions and prediction changes—is methodologically rigorous and moves beyond mere correlation. However, the scope is narrow (two simple 2D games), the authors mischaracterize the findings of Li et al. (2023) regarding linear representations, and single-frame probing misses the temporal structure which is the primary purpose of these world models.
The probing protocol is robust: selectivity gaps between linear and MLP probes are consistently small ($|\Delta| \leq 0.06$ for IRIS, $\Delta \leq 0.10$ for DIAMOND), confirming approximately linear representations rather than deeply nonlinear encodings. The architectural comparison revealing IRIS's flat representation profile versus DIAMOND's peaked inverted-V bottleneck pattern provides genuine insight into how different architectures compress state. Attention analysis showing spatial specialization among heads and token ablation experiments with high rank correlation ($\rho > 0.9$) across three distinct baselines (zero, mean, random) strengthen mechanistic interpretability claims by demonstrating robustness to ablation methodology.
The paper contains a significant error in its characterization of prior work: it states that Li et al. (2023) showed transformers develop "emergent linear representation of the board state," but the original Othello-GPT paper explicitly states that "Linear probes, however, produce poor results" and emphasizes non-linear probe findings.
The scope is limited to two simple, fully-observable 2D Atari games where state variables are easily extractable from RAM. The single-frame probing protocol explicitly cannot evaluate temporal dynamics—the core capability distinguishing world models from static vision models. The negative $R^2$ values observed in early DIAMOND encoder layers (as low as $-1.45$) indicate performance worse than constant predictors, yet this puzzling phenomenon receives only passing mention. Finally, the causal intervention uses coarse unidirectional shifts rather than targeted counterfactual states, and the authors acknowledge that "activation patching along a single direction is a coarse intervention."
The evidence strongly supports the core claim that representations are approximately linear with small selectivity gaps ($\Delta \leq 0.06$ for IRIS and both games). However, comparisons to related work are marred by the misrepresentation of Li et al. (2023) regarding linear representations. The token-level ablation experiments are rigorous, showing consistent importance rankings across baselines ($\rho > 0.92$), though the correlation between KL divergence and spatial proximity to game objects is surprisingly weak in Pong ($r \approx 0.13$) compared to Breakout ($r \approx 0.56$), suggesting information may be distributed less locally in simpler visual scenes.
The paper provides reasonably detailed experimental protocols: Ridge regression ($\alpha = 1.0$), MLP architecture (256→128→1 with ReLU), 5-fold cross-validation, and ground truth extraction from Atari RAM (10,000 frames per game). Model architectures are specified (IRIS: 10 layers, 4 heads, dim 256; DIAMOND: 4 stages, 64 channels). However, no code repository is mentioned or linked, exact checkpoint training procedures are not specified, and the frame sampling methodology (e.g., train/test splits) is not detailed. Independent reproduction would be feasible given architecture specifications provided the training procedures for the base world models are followed exactly.
World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.