Stochastic approximation in non-markovian environments revisited

stat.ML cs.LG math.PR Vivek Shripad Borkar · Mar 22, 2026

What it does

Why it matters

The core insight is that iterates retain memory of the distant past through the tail $\sigma$-field at $-\infty$, offering a theoretical lens on how learning algorithms might encode long-term dependencies. The author proposes this...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper extends stochastic approximation (SA) theory to non-Markovian driving noise that is also non-ergodic, establishing that the ergodic decomposition of the original process corresponds to a Doeblin decomposition of an equivalent Markov chain. The core insight is that iterates retain memory of the distant past through the tail $\sigma$-field at $-\infty$, offering a theoretical lens on how learning algorithms might encode long-term dependencies. The author proposes this framework as a paradigm for understanding transformer attention mechanisms and continual learning, where the entire history influences current updates.

Critical review

Verdict

Bottom line

This is a rigorous theoretical extension of the author's prior work on non-Markovian SA, providing a clean probabilistic analysis of how non-ergodicity manifests in the limit ODE. However, the leap from abstract ergodic decompositions to concrete claims about transformer architectures and continual learning is speculative and unsupported by empirical validation or tight technical correspondence.

“viewing large time windows into the past as an approximation to the infinite past, the above picture gives some intuition about the transformer's attention mechanism zeroing on the relevant aspects of the past”

Borkar, Sec. 4 · Section 4

What holds up

The mathematical framework connecting stationary non-Markovian processes to their Markov mimics via ergodic decomposition is solid. The argument that the tail $\sigma$-field $\sigma_{-\infty} := \cap_{n=-\infty}^{\infty} \sigma(\widetilde{Z}(k), k \leq n)$ encodes which ergodic class is operative, and thus retains memory of the distant past, follows cleanly from the ergodic theory of Markov chains on general spaces. The Doeblin decomposition correspondence is technically sound and novel in this SA context.

“the ergodic decomposition of invariant measures for $\{Z(n)\}$ leads to an identical ergodic decomposition for $\{\widetilde{Z}(n)\}$”

Borkar, Sec. 3 · Section 3

“the information regarding which precise $\psi_x$ is operative will be contained in the 'tail $\sigma$-field at $-\infty$'”

Borkar, Sec. 3 · Section 3

Main concerns

The application to transformers (Section 4) is underdeveloped and purely analogical, acknowledging that transformers use finite windows rather than infinite past while claiming this provides 'intuition' rather than technical insight. The proposal for continual learning via equation (3.3) using exponentially decaying weights $\xi(n) := \sum_{m=0}^{n} \alpha^{\tau_m} \zeta_m$ is sketched in a single paragraph without implementation details, convergence analysis specific to that setting, or empirical verification. The discussion of SGD gradient bias versus finite difference methods appears tangential to the main non-Markovian thesis and potentially confusing to readers.

“the transformer does not involve the entire past, but a large window into the past. Nevertheless, viewing large time windows into the past as an approximation to the infinite past, the above picture gives some intuition”

Borkar, Sec. 4 · Section 4

“This can model continual learning”

Borkar, Sec. 3 · Section 3, (3.3)

Evidence and comparison

The theoretical evidence relies heavily on the author's own prior work ([4]) and standard texts (Meyn & Tweedie, Benaim), which is appropriate for this type of foundational theory. However, comparisons to related work on transformers ([6], [9]) are superficial, citing them merely as 'studied from multiple angles' without engaging with their technical specifics. The Goel et al. work cited ([9]) concerns preconditioned gradient descent for attention training—a distinct optimization perspective—yet the paper positions itself as 'quite distinct' without clarifying how the non-Markovian SA framework complements or contradicts existing optimization analyses.

“Given the enormous current interest in this topic, the latter themes are being studied from multiple angles, see, e.g., [6], [9] to quote two recent contributions. The approach here is, however, quite distinct.”

Borkar, Sec. 4 · Section 4

Reproducibility

This is a purely theoretical paper with no code, data, or experimental protocols provided. Reproduction would require verifying the measure-theoretic arguments regarding ergodic decompositions and Doeblin decompositions on Polish spaces. The lack of empirical validation for the transformer and continual learning applications means there are no hyperparameters, training details, or datasets to reproduce. Assumptions such as compact state space $S$ and Lipschitz continuity of transition kernels are standard but should be checked against specific applications.

“For simplicity, we shall take $S$ to be compact”

Borkar, Sec. 2 · Section 2

Abstract

Based on some recent work of the author on stochastic approximation in non-markovian environments, the situation when the driving random process is non-ergodic in addition to being non-markovian is considered. Using this, we propose an analytic framework for understanding transformer based learning, specifically, the `attention' mechanism, and continual learning, both of which depend on the entire past in principle.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.