Optimal Linear Prediction and Estimation

Last update: 07 Jul 2025 12:15
First version: 9 March 2024; $\epsilon$ improvements (literally) 11 March 2024

Attention conservation notice: Re-worked teaching material, plus link-dumps.

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]

We would like to predict a scalar random variable $Y$ from a $d$-dimensional vector of other random variables $\vec{X}$. We decide that we are only interested in linear predictions, of the form $ b_0 + \vec{b} \cdot \vec{X} $. We want the optimal linear predictor, in the mean-squared sense, i.e., we want to minimize the expected squared error $\Expect{(Y-(b_0 + \vec{b} \cdot \vec{X}))^2}$.

Let's do this first for the case $d=1$, where the book-keeping is simple. We'll use the following basic bits of algebra about random variables:

$\Var{Z} = \Expect{Z^2} - (\Expect{Z})^2$,
$\Var{Z+U} = \Var{Z} + \Var{U} + 2\Cov{U,Z}$,
$\Expect{aZ+bU+c} = a\Expect{Z}+b\Expect{U} + c$,
$\Cov{a Z+b, U} = a\Cov{U,Z}$,
and $\Var{aZ+b} = a^2\Var{Z}$.

Using these rules repeatedly, \begin{eqnarray} \Expect{(Y-(b_0 + b_1 X))^2} & = & \Var{Y-b_0 - b_1 X} + (\Expect{Y - b_0 - b_1 X})^2\\ & = & \Var{Y- b_1 X} + (\Expect{Y} - b_0 - b_1 \Expect{X})^2\\ & = & \Var{Y} + b_1^2 \Var{X} - 2 b_1 \Cov{X,Y} + (\Expect{Y} - b_0 - b_1 \Expect{X})^2 \end{eqnarray} I have left the final squared term alone, because that is the only place the intercept $ b_0 $ shows up. No matter what joint distribution $Y$ and $X$ might have, or what $ b_1 $ we might contemplate, we can always make that square exactly 0 by adjusting $ b_0 $ appropriately. We're thus free to just minimize over $ b_1 $ in the other term [1]: \begin{eqnarray} \left. \frac{\partial}{\partial b_1} \Expect{(Y-(b_0 + b_1 X))^2}\right|_{b_1=\beta_1} & = & 0\\ 2 \beta_1 \Var{X} - 2 \Cov{X,Y} & = & 0 \end{eqnarray} Solving: \begin{eqnarray} \beta_1 & = & \frac{\Cov{X,Y}}{\Var{X}}\\ \beta_0 & = & \Expect{Y} - \beta_1 \Expect{X} \end{eqnarray} The optimal linear predictor of $Y$ from a one-dimensional $X$ is therefore always \[ \beta_0 + \beta_1 X = \Expect{Y} + \frac{\Cov{X,Y}}{\Var{X}} (X - \Expect{X}) \]

The exact same approach works if $d > 1$. We just have to replace scalar variances and covariances with the appropriate matrices: \begin{eqnarray} \Expect{(Y-(b_0 + \vec{b} \cdot \vec{X}))^2} & = & \Var{Y} + \vec{b} \cdot \Var{\vec{X}} \vec{b} - 2 \vec{b} \cdot \Cov{\vec{X}, Y} + \left(\Expect{Y} - b_0 - \vec{b} \cdot \Expect{\vec{X}}\right)^2\\ \vec{\beta} & = & \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ \beta_0 & = & \Expect{Y} - \beta \cdot \vec{X} \end{eqnarray} and the optimal linear prediction is \[ \beta_0 + \vec{\beta} \cdot \vec{X} = \Expect{Y} + \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \cdot (\vec{X} - \Expect{\vec{X}}) \]

Plugging back in to the expected squared error, we get \begin{eqnarray} \Expect{(Y-\beta_0 - \vec{\beta} \cdot \vec{X})^2} & = & \Var{Y - \vec{\beta} \cdot \vec{X}}\\ & = & \Var{Y} + \vec{\beta} \cdot \Var{\vec{X}} \vec{\beta} - 2\vec{\beta} \cdot\Cov{\vec{X}, Y}\\ & = & \Var{Y} + \left(\Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\right) \cdot \Var{\vec{X}} \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\\ & & - 2 \left(\Var{\vec{X}}^{-1} \Cov{\vec{X}, Y}\right) \cdot \Cov{\vec{X}, Y}\\ & = & \Var{Y} - \Cov{\vec{X},Y} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \end{eqnarray} Since $ \Var{\vec{X}}$ is a variance matrix, it's positive semi-definite, i.e., all its eigenvalues are $\geq 0$. Therefore so is its inverse, therefore $\Cov{\vec{X},Y} \cdot \Var{\vec{X}}^{-1} \Cov{\vec{X}, Y} \geq 0$, and the expected squared error of the optimal linear predictor is $\leq \Var{Y}$, with equality only if the vector of covariances $\Cov{\vec{X}, Y}$ is $\vec{0}$, or lies in the null space of $\Var{\vec{X}}^{-1}$ (assuming it has one). In general, the optimal linear predictor will reduce the expected squared error below $\Var{Y}$. (Incidentally, if we started from the last line, few would find it obvious that it is $\geq 0$, but since it is the expectation of a square, and a variance, it must be.)

Define $\epsilon \equiv Y - (\beta_0 + \vec{\beta} \cdot \vec{X})$ as the random error of this prediction. We have just worked out $\Expect{\epsilon^2}$. We can also establish that

The prediction is globally unbiased, $\Expect{\epsilon}=0$, because $\Expect{Y} = \Expect{\beta_0 + \vec{\beta} \cdot \vec{X}}$. (Ensuring this is what what the intercept is for.)
The prediction error is uncorrelated with the predictor variables, $\Cov{\epsilon, \vec{X}} = \vec{0}$. (I will leave this one as an exercise for the reader.)

At no point in any of this have we had to assume:

That the true relationship between $Y$ and $\vec{X}$ is linear;
That any of the variables has a Gaussian distribution;
That $\epsilon$ has a Gaussian distribution, or that $\Expect{\epsilon|\vec{X}}=0$, or that $\Var{\epsilon|\vec{X}}$ is constant, or that $\epsilon$ is independent (or even uncorrelated) from one observation to another,

etc., etc., through all the rest of the usual linear-regression-model assumptions. Absolutely none of that matters at this level at all. All that matters is that (I) we decide we want to use linear predictor functions, and (II) the expectations, variances and covariances appearing in the formulas actually exist. (So, you know, no Cauchy distributions.)

Over years and disciplines, this circle of ideas has been re-discovered, or re-purposed, many times. Here are some instances:

Random-design linear regression: There is a $ d+1 $-dimensional random vector, say $Z$. We pick one coordinate of it and call that $Y$, and the other $ d $ coordinates are $\vec{X}$. (Causal order, temporal order, etc., make no difference whatsoever to the math, which is part of why I avoid talking about "independent" and "dependent" variables.)
Time series interpolation: There is a time series $Z(t)$, observed at times $ t_1 < \ldots < t_d $, and we want its value at a time $t^*$ between $ t_1 $ and $ t_d $. Set $\vec{X}=(Z(t_1), Z(t_2), \ldots Z(t_d))$ and $Y=Z(t^*)$.
Time series extrapolation: As before, but now $t^* > t_d$ (extrapolating into the future, i.e., prediction in the strict sense, a.k.a. forecasting) or $t^* < t_1$ (extrapolating into the past, "retrodiction"). While this might sound very different, and much harder, than interpolating, as far as the math is concerned there is no real difference.; Interpolating and extrapolating time series in this way is sometimes called "Wiener-Kolmogorov" (or "Kolmogorov-Wiener") prediction. In truth, however, I suspect that the discrete-time version I am expounding here was known before those Ancestral Worthies, though they did come up with the continuously-observed version. (I am not touching on the latter here since it involves inverting operators rather than just matrices, etc., etc.)
Spatial interpolation / extrapolation: There is a random field $Z(\mathbf{r})$, set $\vec{X} = (Z(\mathbf{r}_1), Z(\mathbf{r}_2), \ldots Z(\mathbf{r}_d))$ and $Y=Z(\mathbf{r}^*)$.; This is now called "kriging", after D. G. Krige.
Spatio-temporal interpolation / extrapolation: You get the picture by now.
Prediction of one time series by another: There are two time series, $Z(t)$ and $U(t)$. If we want to predict $Z$ from $U$, we set $\vec{X} = (Z(t_1), Z(t_2), \ldots Z(t_d))$ and $Y=U(t^*)$. If we want to predict $U$ from $Z$ and its own history, we set $\vec{X} = (Z(t_1), Z(t_2), \ldots Z(t_{d_1}), U(s_1), U(s_2), \ldots U(s_{d_2}))$ and $Y=U(t^*)$.; Obviously, predicting based on three or more time series works the same way.; Obviously, predicting one spatial or spatio-temporal random field based on others works the same way.
De-noising a time series, possibly with extrapolation: The true time series is $S(t)$, but we only observe $Z(t) = S(t) + N(t)$ where $N(t)$ is in some sense noise. Set $\vec{X} = (Z(t_1), Z(t_2), \ldots Z(t_d))$ and $Y=S(t^*)$. This is the "Wiener filter". It can make sense for $t^*$ to equal one of the $ t_i $ (this is what Wiener called "smoothing"), but we can combine de-noising with interpolation or extrapolation.; We can of course do the same thing with spatial or spatio-temporal random fields. (This is also called "kriging".); Obviously, if we have two or more series / fields which are all noisy observations of the same underlying process, we throw them all in to $\vec{X}$ to improve our prediction / estimation of $Y$.
Latent factor score estimation in psychometrics: People are defined (or at least described) by a $q$-dimensional vector of latent "factors", so person $i$ has score $ F_{ij} $ on factor $j$. The observable scores on $d$ different psychometric tests are linear combinations of the factors plus noise, so $ T_{ik} = \sum_{j=1}^{q}{w_{kj} F_{ij}} + \epsilon_{ik} $, or $\vec{T}_i = \mathbf{w} \vec{F}_i + \vec{\epsilon}_i$, $ w_{kj} $ being the "loading" of test $k$ on factor $j$ and $\epsilon$ being the noise. We estimate each factor score separately by setting $\vec{X} = (T_{i1}, T_{i2}, \ldots T_{id})$ and $Y = F_{ij}$. (You might worry about the optimality of this if factors are known to be correlated, but we can always rotate to a coordinate system where factors are uncorrelated.); (This is the "regression" method of factor score estimation, due to my old favorite Godfrey H. Thomson. There are alternatives, for people who like to have more estimation error.)
Bayesian nonparametric regression: We have a prior distribution over the true function $\mu(t)$, and we assume IID observational noise so we observe $R(t) = \mu(t) + \epsilon(t)$. We have as our data input-output pairs, $\mathcal{D} = (t_1, r_1), \ldots (t_d, r_d)$. What we want is the posterior expectation $\Expect{\mu(t^*)|\mathcal{D}}$. Since this is hard to do in general, we settle for a linear approximation. Set $\vec{X} = (R(t_1), R(t_2), \ldots R(t_d))$ and $Y=\mu(t^*)$. Now all of the necessary expectations, variances and covariances are actually implied by the prior.; If our prior distribution is a Gaussian process, this gives the exact posterior expectation, not just the best linear approximation. In that case, too, the formula for the variance of the prediction error gives us the posterior variance, and so we get the whole posterior distribution for $\mu(t^*)$. But all that is a bit of a bonus.

In the applications to stochastic processes, it is not actually necessary, as far as this math goes, that the process be stationary. This however brings me to a point, which alert/suspicious readers may well have anticipated...

How do I actually estimate any of this? Or, where do the covariances come from?

So far, all of this has been at the level of pure probability theory, where expectations, variances, covariances, and other properties of the true distribution are available to us --- no doubt written in characters of gold on marble tablets by a fiery Hand. In the real world of statistics, as opposed to the mythology of probability, we do not have the true distribution, or even any of its moments, what we have are estimates.

If we have observed many pairs of $\vec{X}$ and $Y$, and we assume their distributions are unchanging --- at least up to the second moments! --- then we can just find the sample means, sample variances and covariances, etc., and plug in. Remarkably (or obviously?) enough, this is the same estimate as just doing a linear regression of $Y$ on $\vec{X}$ and estimating by ordinary least squares [2]. It will be consistent when sample covariances converge on true covariances, which requires something rather weaker than independent observations.

If we are looking at a stationary stochastic process, we can of course try to estimate the covariances. Thus in the case of interpolating or extrapolating a time series, if we think the series is (weakly) stationary, we are committing ourselves to $\Cov{Z(t), Z(s)} = \gamma(|t-s|)$ for some autocovariance function $\gamma$. (Then of course $\Var{Z(t)} = \gamma(0)$.) Writing $\overline{z}$ for the sample mean, every pair of observations $z(t), z(s)$ thus gives us a point $(|t-s|, (z(t)-\overline{z})(z(s)-\overline{z}))$ which we can use to estimate $\gamma$. If all else fails, we could use nonparametric regression to do this. If we assume that $\gamma$ has a certain functional form, that will of course improve the efficiency in our estimate of $\gamma$. We can do the same sort of thing for spatial and spatio-temporal processes, where now additional sorts of symmetry (isotropy, "separating" $\Gamma(h, \mathbf{b})$ into $\Gamma_{time}(h) \Gamma_{space}(\mathbf{b})$, etc.) are additional efficiency-enhancing constraints. Of course, if any of those assumptions are off, they are helping us converge efficiently to a systematically-wrong answer, but even that might be better, for prediction, than glacial convergence to the truth. (I am not interested in factor models for high-dimensional time series because I think those models are right.)

Stationarity thus helps prediction, because it helps us learn the covariances. If we were dealing with non-stationary processes where we knew something about how they are non-stationary, we could use that information instead. This is, of course, a tall order!

Everything I have said so far about estimating the covariances presumed that we can, in fact, simultaneously observe both $\vec{X}$ and $Y$. When the role of $Y$ is taken by a latent variable, however, this is harder to swallow. Take the case where we observe a time series $Z(t) = S(t) + N(t)$, with everything stationary, and want to estimate or predict $S(t^*)$ from $(Z(t_1), Z(t_2), \ldots Z(t_d))$. We will need the variances and covariances $\Cov{Z(t_i), Z(t_j)}$; stationarity in principle lets us estimate those, as $\gamma(t_i - t_j)$. But we will also need $\Cov{Z(t_i), S(t^*)}$, and even if we assume stationarity, how are we to get that, without any joint observations of $Z(t)$ and $S(t+h)$? If we assume that the noise process $N$ is uncorrelated with the state/signal process $S$, and that $N$ has no autocorrelation over time (it is "white noise"), then, for $h\neq 0$ \begin{eqnarray} \gamma(h) & = & \Cov{Z(t), Z(t+h)} \\ & = & \Cov{S(t)+N(t), S(t+h) + N(t+h)}\\ & = & \Cov{S(t), S(t+h)} + \Cov{N(t), S(t+h)} + \Cov{S(t), N(t+h)} + \Cov{N(t), N(t+h)}\\ & = & \Cov{S(t), S(t+h)} \end{eqnarray} I bring this up not because we need the autocovariance function of the signal, but because, similarly, \begin{eqnarray} \Cov{Z(t), S(t+h)} & = & \Cov{S(t) + N(t), S(t+h)}\\ & = & \Cov{S(t), S(t+h)} = \gamma(h) \end{eqnarray} That is, if we assume that we are dealing with white noise, we can learn all the necessary covariances. But if we made a different assumption about the noise (perhaps that it has exponentially-decaying autocovariance), we will get different estimates. Worse, there really are not many reliable ways of checking assumptions about the noise, precisely because $S(t)$ is latent. Or, rather, assumption-checking is hard if all we have is the time series $Z(t)$. We might, for some physical measurement processes, try to actually study how measurement works, what distribution of observations we get when we impose known signals (= sample-paths of $S$), etc., but that work, while scientifically important, is outside the scope of pure statistical methodology...

Similarly with the psychometric situation, except worse. (We sadly lack standardized subjects with calibrated levels of narcissism, implicit racism, etc., who can be run through a new psychological test to estimate its loadings.) We can infer loadings from covariances (up to some here-unimportant symmetries), but doing so rests on some assumptions which are hard-to-impossible to check with just the observable test scores.

Now, any sort of statistical inference necessarily rests on assumptions. (They are what let us connect the data we have to other things we do not directly see, rather than just saying "Yes, that's the data alright".) But there are times when the assumptions can fade into the background, and other times when the math (so to speak) rubs them in our faces. I find it interesting that inference-to-latents is one of the latter situations.

[1]: If you insist on taking the derivative of the square with respect to $ b_1 $, you will find that it vanishes at the value of $ b_0 $ which zeroes out the square. (Because $\frac{dx^2}{dx} = 2x$, which is zero at the origin.) ^

[2]: Here is a quick way to convince yourself of this, if perhaps not a very intuitive one. The argument which lead to expressing the optimal linear predictor entirely in terms of the first and second moments of the joint distribution of $\vec{X}$ and $Y$ was completely indifferent to what that distribution might be. Therefore, if we have a bunch of observation pairs $(\vec{x}_1, y_1), (\vec{x}_2, y_2), \ldots (\vec{x}_n, y_n)$, we can apply the argument to the resulting empirical distribution. Expectations under the empirical distribution are sample means, and variances and covariances under the empirical distribution are sample variances and covariances. Since (i) those "plug-in" estimates minimize the in-sample mean squared error (by the argument above), and (ii) OLS estimates also minimize the in-sample MSE (that is how we derive OLS!), and (iii) the minimizer of the MSE is unique (exercise!), we can conclude that the plug-in estimates and OLS must coincide. If you are skeptical, you will find it enlightening to write the OLS formula as $\hat{\beta} = (\frac{1}{n}\mathbf{x}^T\mathbf{x})^{-1}(\frac{1}{n}\mathbf{x}^T\mathbf{y})$, with the traditional column of 1s in $\mathbf{x}$, and explicitly work everything out for one or two regressors (i.e., for $ d=1 $ and $d = 2$), including inverting the matrix. ^

I have no conclusions here.

A. N. Kolmogorov, "Interpolation und Extrapolation von stationären zufälligen Folgen" Bulletin of the Academy Sciences, USSR, Mathematical Series 5 (1941): 3--14 [I will confess that I read this in English translation, once, in graduate school, and remember almost nothing about it. But that was back when I was punctilious about checking that famous papers really did say what they are remembered as saying. (Now I presume that they usually do not say any such thing.)]
Norbert Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Application (MIT Press, 1949; first circulated as a classified report, 1942) [A notoriously hard book to read, because (i) he wrote as though everyone were as good at Fourier analysis as he was, and (ii) he insisted on doing everything in continuous time, which made some sense with his analog hardware, but obscures the underlying idea]
Emanuel Parzen [Kolmogorov and Wiener were followed by a large literature which developed along the lines they set out. Much of this is summarized, and surpassed, in a pair of extraordinary papers by Parzen, which, incidentally, may be the first to introduce reproducing-kernel Hilbert space methods into statistics]
- "A New Approach to the Synthesis of Optimal Smoothing and Prediction Systems", pp. 75--108 in Richard Bellman (ed.), Mathematical Optimization Techniques: Papers presented at the Symposium on Mathematical Optimization Techniques, Santa Monica, California, October 18--20, 1960 (Berkeley: University of California Press, 1963)
- "An Approach to Time Series Analysis", The Annals of Mathematical Statistics 32 (1961): 951--989 [JSTOR]
D. G. Krige, Lognormal-de Wijsian Geostatistics for Ore Evaluation [PDF via the publisher, the South African Institute of Mining and Metallurgy. Krige first published his methods in papers in the 1950s, which I have not tracked down; this is his later (1981) summary. The idea was taken up by Matheron, who coined the term "kriging". (Fortunately Wiener did his own publicity, rather than leaving it to the French, or else we would presumably be stuck with "wienering".)]
Georges Matheron, Matheron's Theory of Regionalised Variables (Based on lecture notes from 1970, translated from French into English by Charles Huijbregts, edited by Vera Pawlowsky-Glahn and Jean Serra. Oxford University Press, 2019)
Godfrey H. Thomson, "Some Points of Mathematical Technique in the Factorial Analysis of Ability", Journal of Educational Psychology 27 (1936): 37--54

This is how I've taught linear regression since 2008. The Truth About Linear Regression is several hundred pages of this perspective, including a lot on what the usual Gaussian assumptions do and do not add. (The short version is chapter 2 of Advanced Data Analysis from an Elementary Point of View.)
There is also a lot of this, naturally, in my Data over Space and Time course.

Kristin Kirchner, David Bolin, "Necessary and sufficient conditions for asymptotically optimal linear prediction of random fields on compact metric spaces", Annals of Statistics 50 (2022): 1038--1065, arxiv:2005.08904