Deep Reinforcement Learning and The Tale of Two Temporal Difference Errors

cs.LG cs.AI Juan Sebastian Rojas, Chi-Guhn Lee · Mar 23, 2026

What it does

Why it matters

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper identifies a subtle but important distinction between two interpretations of the TD error in reinforcement learning: the explicit form (bootstrapped target minus prediction) commonly used in deep RL, and the implicit form (difference between temporally successive predictions) from the original Sutton (1988) formulation. While equivalent in tabular settings, the authors demonstrate that increasingly nonlinear architectures cause these to diverge significantly, with profound implications for average-reward and differential RL algorithms.

Critical review

Verdict

Bottom line

The paper presents a compelling argument that deep RL practitioners have been using a potentially misleading estimate of the TD error. The authors formally prove that the equivalence between explicit ($\delta^e$) and implicit ($\delta^i$) TD errors breaks down under function approximation, and demonstrate empirically that this divergence can destabilize average-reward estimates in differential Q-learning when using the explicit form.

“In the linear function approximation setting, given an update batch size of 1, the explicit and implicit TD errors are only equal to each other at time $t$ if and only if the feature vector satisfies $||\boldsymbol{x}(S_t,A_t)||^2=1$, or if at time $t$ the RL algorithm has converged to a (potentially suboptimal) solution, such that $\delta^e_t=\delta^i_t=0$.”

Rojas and Lee, Lemma 3.4 · Section 3.1

What holds up

The analytical characterization in Sections 3.1-3.3 provides a rigorous foundation for understanding when and why the two TD error interpretations diverge. The derivation showing that batch updates introduce additional divergence through feature vector inner products (Lemma 3.5) is particularly insightful. Most compelling is the empirical demonstration that using the implicit TD error stabilizes deep differential RL (Figure 2), where the explicit TD error causes average-reward estimates to diverge dramatically depending on initialization.

“$\bar{R}_n = \bar{R}_0 + \eta\sum_{t=0}^{n-1}\alpha_t\delta^i_t + \eta\sum_{t=0}^{n-1}\alpha_t\epsilon_t$”

Equation 21 · Section 4.1

“This compounded error term can hence be viewed as one of the sources of instability in the average-reward estimates displayed in Figure 2 for deep differential algorithms that utilize the explicit TD error to update the average-reward estimate.”

Rojas and Lee · Section 4.1

Main concerns

While the paper identifies the divergence issue clearly, it offers limited theoretical quantification of the divergence magnitude for deep nonlinear networks beyond noting it increases with nonlinearity. The Discussion section explicitly acknowledges this limitation: "quantifying (or bounding) this difference in deep RL settings remains an open research question." Additionally, the performance improvements in A2C and reward centering (Figure 3c) appear modest compared to the dramatic stability differences shown for differential RL, suggesting the practical impact may be limited to specific algorithm classes that use TD errors for auxiliary estimates rather than just critic updates.

“quantifying (or bounding) this difference in deep RL settings remains an open research question.”

Rojas and Lee · Section 5

Evidence and comparison

The paper correctly situates its contribution relative to prior work, noting that while Exercise 9.6 in Sutton and Barto (2018) hinted at this discrepancy, no prior work has formally characterized the differences or proposed using the implicit TD error in deep RL settings. The comparison to Wan et al. (2021) regarding the convergence proof for differential Q-learning is accurate—the proof relied on an n-step implicit interpretation, yet existing deep RL implementations used the explicit form. The empirical evaluation spans multiple domains (Inverted Pendulum, Atari Breakout/Pong, MuJoCo HalfCheetah) and algorithm classes (DQN, reward centering, A2C), though the sample sizes (4-8 runs) are standard but modest.

“we are hardly the first to notice a discrepancy between the two interpretations of the TD error in function approximation settings (e.g. see Exercise 9.6 in Sutton and Barto [2018]), to the best of our knowledge, this work is the first to provide a formal exploration and characterization of the differences between the two interpretations of the TD error.”

Rojas and Lee · Section 3

Reproducibility

The paper provides extensive experimental details in Appendix B, including complete network architectures (e.g., the "small" network with 0.4M parameters versus "large" with 6.7M), hyperparameter ranges tested (e.g., $\alpha \in \{2\text{e-}6, 2\text{e-}5, 2\text{e-}4, 2\text{e-}3\}$), and random seeds (4 or 8 runs per experiment). The pseudocode for all algorithms (including the proposed implicit TD error variants and baselines) is provided in Appendix A and Appendix B.4. However, no public code repository or implementation link is mentioned in the provided text, which would facilitate independent reproduction of the differential Q-learning stability results.

“$\delta^i_b = \frac{1}{\alpha}(\hat{q}(S_b,A_b,\boldsymbol{w}') - \hat{q}(S_b,A_b,\boldsymbol{w}))$”

Algorithm 1 · Appendix A

“A value function step size of $2\text{e-}5$ and an average-reward $\eta$ of $1.0$ yielded the best results and were used to generate the results displayed in Figure 3a).”

Appendix B · Section B.3

Abstract

The temporal difference (TD) error was first formalized in Sutton (1988), where it was first characterized as the difference between temporally successive predictions, and later, in that same work, formulated as the difference between a bootstrapped target and a prediction. Since then, these two interpretations of the TD error have been used interchangeably in the literature, with the latter eventually being adopted as the standard critic loss in deep reinforcement learning (RL) architectures. In this work, we show that these two interpretations of the TD error are not always equivalent. In particular, we show that increasingly-nonlinear deep RL architectures can cause these interpretations of the TD error to yield increasingly different numerical values. Then, building on this insight, we show how choosing one interpretation of the TD error over the other can affect the performance of deep RL algorithms that utilize the TD error to compute other quantities, such as with deep differential (i.e., average-reward) RL methods. All in all, our results show that the default interpretation of the TD error as the difference between a bootstrapped target and a prediction does not always hold in deep RL settings.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.