Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors
This paper identifies a subtle but important distinction between two interpretations of the TD error in reinforcement learning: the explicit form (bootstrapped target minus prediction) commonly used in deep RL, and the implicit form (difference between temporally successive predictions) from the original Sutton (1988) formulation. While equivalent in tabular settings, the authors demonstrate that increasingly nonlinear architectures cause these to diverge significantly, with profound implications for average-reward and differential RL algorithms.
The paper presents a compelling argument that deep RL practitioners have been using a potentially misleading estimate of the TD error. The authors formally prove that the equivalence between explicit ($\delta^e$) and implicit ($\delta^i$) TD errors breaks down under function approximation, and demonstrate empirically that this divergence can destabilize average-reward estimates in differential Q-learning when using the explicit form.
The analytical characterization in Sections 3.1-3.3 provides a rigorous foundation for understanding when and why the two TD error interpretations diverge. The derivation showing that batch updates introduce additional divergence through feature vector inner products (Lemma 3.5) is particularly insightful. Most compelling is the empirical demonstration that using the implicit TD error stabilizes deep differential RL (Figure 2), where the explicit TD error causes average-reward estimates to diverge dramatically depending on initialization.
While the paper identifies the divergence issue clearly, it offers limited theoretical quantification of the divergence magnitude for deep nonlinear networks beyond noting it increases with nonlinearity. The Discussion section explicitly acknowledges this limitation: "quantifying (or bounding) this difference in deep RL settings remains an open research question." Additionally, the performance improvements in A2C and reward centering (Figure 3c) appear modest compared to the dramatic stability differences shown for differential RL, suggesting the practical impact may be limited to specific algorithm classes that use TD errors for auxiliary estimates rather than just critic updates.
The paper correctly situates its contribution relative to prior work, noting that while Exercise 9.6 in Sutton and Barto (2018) hinted at this discrepancy, no prior work has formally characterized the differences or proposed using the implicit TD error in deep RL settings. The comparison to Wan et al. (2021) regarding the convergence proof for differential Q-learning is accurate—the proof relied on an n-step implicit interpretation, yet existing deep RL implementations used the explicit form. The empirical evaluation spans multiple domains (Inverted Pendulum, Atari Breakout/Pong, MuJoCo HalfCheetah) and algorithm classes (DQN, reward centering, A2C), though the sample sizes (4-8 runs) are standard but modest.
The paper provides extensive experimental details in Appendix B, including complete network architectures (e.g., the "small" network with 0.4M parameters versus "large" with 6.7M), hyperparameter ranges tested (e.g., $\alpha \in \{2\text{e-}6, 2\text{e-}5, 2\text{e-}4, 2\text{e-}3\}$), and random seeds (4 or 8 runs per experiment). The pseudocode for all algorithms (including the proposed implicit TD error variants and baselines) is provided in Appendix A and Appendix B.4. However, no public code repository or implementation link is mentioned in the provided text, which would facilitate independent reproduction of the differential Q-learning stability results.
The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.