Stochastic Approximation Algorithms (Including Stochastic Gradient Descent)
Last update: 22 Dec 2025 12:44First version: 2 September 2007
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]
Logically, "stochastic approximation" could refer to a great range of things, but in practice it has become something of a technical term for procedures that approximate the solution of an equation, observed through noise, or which try to minimize a function, again observed through noise. The former --- root-finding --- is sometimes known as the Robbins-Monro problem, and the latter --- minimization --- as the Kiefer-Wolfowitz problem. (Tangentially, that Wolfowitz is the father of the Wolfowitz who helped drag us into the invasion of Iraq.) This turns out to be a subject with deep connections to the theory of on-line learning algorithms and recursive estimation, which is really how I became interested in it, but I also like it nowadays because it provides some very cute, yet powerful, probability examples...
A sketch of the basic idea, after teaching it for the nth time (23 October 2025)
\[ \newcommand{\TrueFunc}{m} \newcommand{\NoisyObs}{Y} \]We want to solve the equation \( \TrueFunc(\theta) = c \), with solution (say) \(\theta^* \). (Let's pretend there's only one solution for simplicity.) However, we do not have access to the actual function \( \TrueFunc \). Instead, when we set \( \theta \), we get a random variable \( \NoisyObs(\theta) \), with the guarantee that \( \Expect{\NoisyObs(\theta)} = \TrueFunc(\theta) \), and a further guarantee that successive \( \NoisyObs \)s are independent (and identically distributed, if we keep \( \theta \) fixed). A very naive idea would be to pick some initial \( \theta^{(1)} \), then take many observations there, \( \NoisyObs_1( \theta^{(1)}), \NoisyObs_2( \theta^{(1)}), \ldots \NoisyObs_n( \theta^{(1)}), \), and average them, since \( \TrueFunc(\theta^{(1)}) \approx \frac{1}{n}\sum_{i=1}^{n}{\NoisyObs_i(\theta^{(1)})} \) (by the law of large numbers). Then we'd pick a different \( \theta \) and repeat the process, trying to build up a picture of the function from noisy samples. (Maybe we then interpolate or something to get a smooth approximation to \( \NoisyObs \), and solve the equation for that approximation?) The basic, brilliant Robbins-Monro idea is instead to iteratively adjust \( \theta \) after each trial, based on how close \( \NoisyObs(\theta) \) is to the desired level. As with the naive idea, we start with an initial guess \( \theta^{(1)} \), and then adjust: \[ \theta^{(t+1)} = \theta^{(t)} - \eta_t (\NoisyObs(\theta^{(t)} - c)) \] Here \( \eta_t \) is a series of (deterministic) control settings, which I'll come back to in a moment. If \( \theta^{(t)}) = \theta^*\), if we've hit exactly the solution, then on average we won't move in any direction. If \( \theta^{(t)} \) is close to the solution \( \theta^* \), and the function \( \TrueFunc \) is reasonably continuous, then we should not move much, but we should in fact move in the direction of the solution. Of course since \( \NoisyObs(\theta) \) is a random variable, even if we chance on a solution there will be some noisy motion. If we started from a solution, \( \theta^{(1)} = \theta^* \) (don't laugh! it could happen!), and the rates \( \eta_t \) are small, then after \( n \) small steps we'd be looking at \begin{eqnarray} \theta^{(n)} & = & \theta^* - \sum_{t=1}^{n}{\eta_t (\NoisyObs(\theta^{(t)}) - c) }\\ & \approx & \theta^* - \sum_{t=1}^{n}{\eta_t(\NoisyObs(\theta^*) - c)} \end{eqnarray} (again, invoking as much smoothness as is convenient to the argument). This wil be unbiased, \( \Expect{\theta^{(n)}} = \theta^* \), but it will have some variance, \( \Var{\theta^{(n)}} = \Var{\NoisyObs(\theta^*)}\sum_{t=1}^{n}{\eta_t^2} \). This suggests that we want the sum of the \( \eta_t \) to be finite, so a convenient choice is \( \eta_t = \eta_0/t \). This gives us a root-mean-square error around the true solution \( \propto \eta_0 \), which we can make arbitrarily small by taking \( \eta_0 \) smaller (and running the approximation for more steps). Since this is evidently a Markov process, we get the same conclusion if we don't start at the solution, but just reach it after some number of steps, only with a little more book-keeping.
(This is all a tissue of blatant fallacies heuristic, but actual math more rigorous treatments, dating back to Robbins and Monroe, confirm the conclusion that we can take \( \eta_t = \eta_0/t \).)
The simplest way to turn this into an optimization procedure is to assume that the Optimization Gods are smiling upon us, so the minimum (or maximum, as desired) of \( L(\theta) \) is the point \( \theta^* \) where the gradient is zero, \( \nabla L(\theta^*) = 0 \). Then we decide that that \( \TrueFunc(\theta) \) is the gradient of our objective function \( L(\theta) \), and set the desired level \( c = 0 \). The basic stochastic approximation procedure above immediately yields the iteration \[ \theta^{(t+1)} = \theta^{(t)} - \eta_t \NoisyObs(\theta^{(t)} \] so \[ \Expect{\theta^{(t+1)}} = \theta^{(t)} - \eta_t \nabla L(\theta^{(t)}) \] and we are doing stochastic gradient descent. Today (2025), a truly shocking amount of computing power is devoted to doing this, where we get \( \NoisyObs(\theta) \) by randomly sampling one observation \( X \) in a really big data set, and \( \NoisyObs(\theta) = - \nabla \log{p(X;\theta)} \) for some wishfully-named probability model or other \( p \).
(This is not quite what Kiefer and Wolfowitz did, perhaps because the idea of having a random sample of the gradient seemed far-fetched to them, perhaps because they were interested in problems where the gradient isn't zero at the optimum, e.g., their procedure could handle maximizing something like \( L(\theta) = e^{-|\theta-\theta^*|} \).)
Somebody should write a proper history of stochastic approximation, and the rise and fall and rise again of stochastic gradient descent, because I think it would be a fascinating saga of the Cold War and the post-Cold War. (Could it be a coincidence, comrades, that the first two papers, and Albert and Gardner's monograph from the 1960s, were all funded by the US Office of Naval Research?) If someone has already written that history, or anything like it, I'd love to read it.
- See also:
- Low-Regret Learning
- Markov Processes
- Monte Carlo
- Optimization
- Sequential Decision-Making Under Uncertainty
- Recommended, big picture:
- Michel Benaïm, "Dynamics of stochastic approximation algorithms", Séminaire de probabilités (Strasbourg) 33 (1999): 1--68 [Link to full text, bibliography, etc.]
- Tze Leung Lai, "Stochastic Approximation", Annals of Statistics 31 (2003): 391--406
- M. B. Nevel'son and R. Z. Hasminskii, Stochastic Approximation and Recursive Estimation [A truly excellent book; I don't suppose anyone has a copy they'd like to sell?]
- Robin Pemantle, "A Survey of Random Processes with Reinforcement", math.PR/0610076
- Recommended, the origins:
- Herbert Robbins and Sutton Monro, "A Stochastic Approximation Method", Annals of Mathematical Statistics 22 (1951): 400--407
- J. Kiefer and J. Wolfowitz, "Stochastic Estimation of the Maximum of a Regression Function", Annals of Mathematical Statistics 23 (1952): 462--466
- Recommended, close-ups:
- Arthur E. Albert and Leland A. Gardner, Jr., Stochastic Approximation and Nonlinear Regression
- Léon Bottou and Olivier Bousquet, "The Tradeoffs of Large Scale Learning", pp. 351--368 of Suvrit Sra, Sebastian Nowozin and Stephen J. Wright (eds.), Optimization for Machine Learning [PDF reprint via Dr. Bottou]
- Paul Dupuis, "Large Deviations Analysis of Some Recursive Algorithms with State-Dependent Noise", Annals of Probability 16 (1988): 1509--1536
- Abdelkader Mokkadem, Mariane Pelletier, and Yousri Slaoui
- "The stochastic approximation method for the estimation of a multivariate probability density", arxiv:0807.2960
- "Revisiting Révész's stochastic approximation method for the estimation of a regression function", arxiv:0812.3973
- David Saad (ed.), On-line Learning in Neural Networks
- To read:
- Francis Bach and Eric Moulines, "Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)", arxiv:1306.2119
- Johanna Bertl, Gregory Ewing, Carolin Kosiol, Andreas Futschik, "Approximate Maximum Likelihood Estimation", arxiv:1507.04553
- Vivek S. Borkar, Stochastic Approxmation: A Dynamical Systems Viewpoint
- D. L. Burkholder, "On a Class of Stochastic Approximation Processes", Annals of Mathematical Statistics 27 (1956): 1044--1059
- Peng Chen, Qi-Man Shao, Lihu Xu, "A probability approximation framework: Markov process approach", arxiv:2011.10985
- Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang, "Statistical inference for model parameters in stochastic gradient descent", Annals of Statistics 48 (2020): 251--273
- Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang, "Asymptotic normality and optimality in nonsmooth stochastic approximation", arxiv:2301.06632
- Aymeric Dieuleveut, Alain Durmus, and Francis Bach, "Bridging the gap between constant step size stochastic gradient descent and Markov chains", Annals of Statistics 48 (2020): 1348--1382
- Sebastien Gadat and Laurent Younnes, "A Stochastic Algorithm for Feature Selection in Pattern Recognition", Journal of Machine Learning Research 8 (2007): 509--547
- Guillaume Garrigos, Robert M. Gower, "Handbook of Convergence Theorems for (Stochastic) Gradient Methods", arxiv:2301.11235
- Tamir Hazan, George Papandreou, and Daniel Tarlow (eds.), Perturbations, Optimization, and Statistics
- Sameer Kamal, "On the convergence, lock-in probability and sample complexity of stochastic approximation", arxiv:1007.4684
- H. J. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and Applications
- Sophie Laruelle and Gilles Pages, "Stochastic Approximation with Averaging Innovation", arxiv:10067.3578
- Jonas Latz, "Analysis of Stochastic Gradient Descent in Continuous Time", Statistics and Computing 31 (2021): 39, arxiv:2004.07177
- Abdelkader Mokkadem and Mariane Pelletier
- "Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms", Annals of Applied Probability 16 (2006): 1671--1702, math.PR/0610329
- "A companion for the Kiefer-Wolfowitz-Blum stochastic approximation algorithm", math.ST/0610487 [Simultaneously estimating both the location and the magnitude of the maximum]
- Deanna Needell, Nathan Srebro, Rachel Ward, "Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm", arxiv:1310.5715
- Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington, "Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions", arxiv:2206.07252
- Mariane Pelletier, "Weak Convergence Rates for Stochastic Approximation with Applications to Multiple Targets and Simulated Annealing", Annals of Applied Probability 8 (1998): 10--44
- Ohad Shamir, Tong Zhang, "Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes", arxiv:1212.1824
- Panos Toulis, Thibaut Horel and Edoardo M. Airoldi, "The proximal Robbins–Monro method", Journal of the Royal Statistical Society B 83 (2021): 188--212