What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

cs.LG cs.AI Xinyu Zhang · Mar 23, 2026
Local to this browser
What it does
World models for reinforcement learning learn to simulate environment dynamics, yet what they represent internally remains unclear. This paper probes two architecturally distinct models—IRIS (a discrete token transformer) and DIAMOND (a...
Why it matters
This paper probes two architecturally distinct models—IRIS (a discrete token transformer) and DIAMOND (a continuous diffusion UNet)—on Atari Breakout and Pong using linear and MLP probes, causal interventions, and attention analysis to...
Main concern
The paper provides compelling empirical evidence that learned world models develop approximately linear, functionally-causal representations of game state across two distinct architectures. The causal intervention analysis—showing strong...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

World models for reinforcement learning learn to simulate environment dynamics, yet what they represent internally remains unclear. This paper probes two architecturally distinct models—IRIS (a discrete token transformer) and DIAMOND (a continuous diffusion UNet)—on Atari Breakout and Pong using linear and MLP probes, causal interventions, and attention analysis to test whether they develop structured, interpretable representations of game state. The core finding is that world models develop approximately linear representations of salient state variables (ball position, score) that are not merely correlated but functionally used during prediction.

Critical review
Verdict
Bottom line

The paper provides compelling empirical evidence that learned world models develop approximately linear, functionally-causal representations of game state across two distinct architectures. The causal intervention analysis—showing strong correlations ($r > 0.95$) between activation shifts along probe directions and prediction changes—is methodologically rigorous and moves beyond mere correlation. However, the scope is narrow (two simple 2D games), the authors mischaracterize the findings of Li et al. (2023) regarding linear representations, and single-frame probing misses the temporal structure which is the primary purpose of these world models.

“Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher $R^2$”
paper · Abstract
“Correlation between $|\alpha|$ and KL divergence is strong: $r=0.97$ (ball_x), $r=0.97$ (ball_y), $r=0.97$ (player_x)”
paper · Section 3.2
What holds up

The probing protocol is robust: selectivity gaps between linear and MLP probes are consistently small ($|\Delta| \leq 0.06$ for IRIS, $\Delta \leq 0.10$ for DIAMOND), confirming approximately linear representations rather than deeply nonlinear encodings. The architectural comparison revealing IRIS's flat representation profile versus DIAMOND's peaked inverted-V bottleneck pattern provides genuine insight into how different architectures compress state. Attention analysis showing spatial specialization among heads and token ablation experiments with high rank correlation ($\rho > 0.9$) across three distinct baselines (zero, mean, random) strengthen mechanistic interpretability claims by demonstrating robustness to ablation methodology.

“Both models dramatically outperform baselines, with raw pixels failing on ball position ($R^2=-1.31$)”
paper · Section 3.1
“Rank correlation across methods is high ($\rho=0.93$ zero/mean, $\rho>0.99$ zero/random)”
paper · Section 3.3
Main concerns

The paper contains a significant error in its characterization of prior work: it states that Li et al. (2023) showed transformers develop "emergent linear representation of the board state," but the original Othello-GPT paper explicitly states that "Linear probes, however, produce poor results" and emphasizes non-linear probe findings.

The scope is limited to two simple, fully-observable 2D Atari games where state variables are easily extractable from RAM. The single-frame probing protocol explicitly cannot evaluate temporal dynamics—the core capability distinguishing world models from static vision models. The negative $R^2$ values observed in early DIAMOND encoder layers (as low as $-1.45$) indicate performance worse than constant predictors, yet this puzzling phenomenon receives only passing mention. Finally, the causal intervention uses coarse unidirectional shifts rather than targeted counterfactual states, and the authors acknowledge that "activation patching along a single direction is a coarse intervention."

“Linear probes, however, produce poor results.”
Li et al., 2023 (Othello-GPT) · Abstract and Section 3
“Li et al. (2023) showed that a transformer trained to predict Othello moves develops an emergent linear representation of the board state”
paper · Section 1
“Single-frame probes may miss temporal structure that the transformer encodes”
paper · Section 4
Evidence and comparison

The evidence strongly supports the core claim that representations are approximately linear with small selectivity gaps ($\Delta \leq 0.06$ for IRIS and both games). However, comparisons to related work are marred by the misrepresentation of Li et al. (2023) regarding linear representations. The token-level ablation experiments are rigorous, showing consistent importance rankings across baselines ($\rho > 0.92$), though the correlation between KL divergence and spatial proximity to game objects is surprisingly weak in Pong ($r \approx 0.13$) compared to Breakout ($r \approx 0.56$), suggesting information may be distributed less locally in simpler visual scenes.

“Pong shows weaker spatial correlation ($r \approx 0.13$), suggesting information is distributed less spatially in simpler scenes”
paper · Section 3.3
“$\Delta_{\text{IRIS}}$: +0.06 for ball_x, +0.01 for ball_y”
paper · Table 1
Reproducibility

The paper provides reasonably detailed experimental protocols: Ridge regression ($\alpha = 1.0$), MLP architecture (256→128→1 with ReLU), 5-fold cross-validation, and ground truth extraction from Atari RAM (10,000 frames per game). Model architectures are specified (IRIS: 10 layers, 4 heads, dim 256; DIAMOND: 4 stages, 64 channels). However, no code repository is mentioned or linked, exact checkpoint training procedures are not specified, and the frame sampling methodology (e.g., train/test splits) is not detailed. Independent reproduction would be feasible given architecture specifications provided the training procedures for the base world models are followed exactly.

“We extract frozen representations from all layers ($N=10,000$ frames per game)... Ridge regression ($\alpha=1.0$) and 2-layer MLP probes (256→128→1, ReLU, Adam), both with 5-fold CV $R^2$”
paper · Section 2.2
“IRIS tokenizes 64×64 observations into 16 discrete tokens (VQ-VAE, codebook 512, 4×4 grid) then predicts sequences with a GPT-2 transformer (10 layers, 4 heads, dim 256)”
paper · Section 2.1
Abstract

World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.