Notebooks

Stochastic Approximation Algorithms (Including Stochastic Gradient Descent)

Last update: 22 Dec 2025 12:44
First version: 2 September 2007

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]

Logically, "stochastic approximation" could refer to a great range of things, but in practice it has become something of a technical term for procedures that approximate the solution of an equation, observed through noise, or which try to minimize a function, again observed through noise. The former --- root-finding --- is sometimes known as the Robbins-Monro problem, and the latter --- minimization --- as the Kiefer-Wolfowitz problem. (Tangentially, that Wolfowitz is the father of the Wolfowitz who helped drag us into the invasion of Iraq.) This turns out to be a subject with deep connections to the theory of on-line learning algorithms and recursive estimation, which is really how I became interested in it, but I also like it nowadays because it provides some very cute, yet powerful, probability examples...

A sketch of the basic idea, after teaching it for the nth time (23 October 2025)

\[ \newcommand{\TrueFunc}{m} \newcommand{\NoisyObs}{Y} \]

We want to solve the equation \( \TrueFunc(\theta) = c \), with solution (say) \(\theta^* \). (Let's pretend there's only one solution for simplicity.) However, we do not have access to the actual function \( \TrueFunc \). Instead, when we set \( \theta \), we get a random variable \( \NoisyObs(\theta) \), with the guarantee that \( \Expect{\NoisyObs(\theta)} = \TrueFunc(\theta) \), and a further guarantee that successive \( \NoisyObs \)s are independent (and identically distributed, if we keep \( \theta \) fixed). A very naive idea would be to pick some initial \( \theta^{(1)} \), then take many observations there, \( \NoisyObs_1( \theta^{(1)}), \NoisyObs_2( \theta^{(1)}), \ldots \NoisyObs_n( \theta^{(1)}), \), and average them, since \( \TrueFunc(\theta^{(1)}) \approx \frac{1}{n}\sum_{i=1}^{n}{\NoisyObs_i(\theta^{(1)})} \) (by the law of large numbers). Then we'd pick a different \( \theta \) and repeat the process, trying to build up a picture of the function from noisy samples. (Maybe we then interpolate or something to get a smooth approximation to \( \NoisyObs \), and solve the equation for that approximation?) The basic, brilliant Robbins-Monro idea is instead to iteratively adjust \( \theta \) after each trial, based on how close \( \NoisyObs(\theta) \) is to the desired level. As with the naive idea, we start with an initial guess \( \theta^{(1)} \), and then adjust: \[ \theta^{(t+1)} = \theta^{(t)} - \eta_t (\NoisyObs(\theta^{(t)} - c)) \] Here \( \eta_t \) is a series of (deterministic) control settings, which I'll come back to in a moment. If \( \theta^{(t)}) = \theta^*\), if we've hit exactly the solution, then on average we won't move in any direction. If \( \theta^{(t)} \) is close to the solution \( \theta^* \), and the function \( \TrueFunc \) is reasonably continuous, then we should not move much, but we should in fact move in the direction of the solution. Of course since \( \NoisyObs(\theta) \) is a random variable, even if we chance on a solution there will be some noisy motion. If we started from a solution, \( \theta^{(1)} = \theta^* \) (don't laugh! it could happen!), and the rates \( \eta_t \) are small, then after \( n \) small steps we'd be looking at \begin{eqnarray} \theta^{(n)} & = & \theta^* - \sum_{t=1}^{n}{\eta_t (\NoisyObs(\theta^{(t)}) - c) }\\ & \approx & \theta^* - \sum_{t=1}^{n}{\eta_t(\NoisyObs(\theta^*) - c)} \end{eqnarray} (again, invoking as much smoothness as is convenient to the argument). This wil be unbiased, \( \Expect{\theta^{(n)}} = \theta^* \), but it will have some variance, \( \Var{\theta^{(n)}} = \Var{\NoisyObs(\theta^*)}\sum_{t=1}^{n}{\eta_t^2} \). This suggests that we want the sum of the \( \eta_t \) to be finite, so a convenient choice is \( \eta_t = \eta_0/t \). This gives us a root-mean-square error around the true solution \( \propto \eta_0 \), which we can make arbitrarily small by taking \( \eta_0 \) smaller (and running the approximation for more steps). Since this is evidently a Markov process, we get the same conclusion if we don't start at the solution, but just reach it after some number of steps, only with a little more book-keeping.

(This is all a tissue of blatant fallacies heuristic, but actual math more rigorous treatments, dating back to Robbins and Monroe, confirm the conclusion that we can take \( \eta_t = \eta_0/t \).)

The simplest way to turn this into an optimization procedure is to assume that the Optimization Gods are smiling upon us, so the minimum (or maximum, as desired) of \( L(\theta) \) is the point \( \theta^* \) where the gradient is zero, \( \nabla L(\theta^*) = 0 \). Then we decide that that \( \TrueFunc(\theta) \) is the gradient of our objective function \( L(\theta) \), and set the desired level \( c = 0 \). The basic stochastic approximation procedure above immediately yields the iteration \[ \theta^{(t+1)} = \theta^{(t)} - \eta_t \NoisyObs(\theta^{(t)} \] so \[ \Expect{\theta^{(t+1)}} = \theta^{(t)} - \eta_t \nabla L(\theta^{(t)}) \] and we are doing stochastic gradient descent. Today (2025), a truly shocking amount of computing power is devoted to doing this, where we get \( \NoisyObs(\theta) \) by randomly sampling one observation \( X \) in a really big data set, and \( \NoisyObs(\theta) = - \nabla \log{p(X;\theta)} \) for some wishfully-named probability model or other \( p \).

(This is not quite what Kiefer and Wolfowitz did, perhaps because the idea of having a random sample of the gradient seemed far-fetched to them, perhaps because they were interested in problems where the gradient isn't zero at the optimum, e.g., their procedure could handle maximizing something like \( L(\theta) = e^{-|\theta-\theta^*|} \).)

Somebody should write a proper history of stochastic approximation, and the rise and fall and rise again of stochastic gradient descent, because I think it would be a fascinating saga of the Cold War and the post-Cold War. (Could it be a coincidence, comrades, that the first two papers, and Albert and Gardner's monograph from the 1960s, were all funded by the US Office of Naval Research?) If someone has already written that history, or anything like it, I'd love to read it.


Notebooks:   Powered by Blosxom