Interpolation in Statistical Learning, or, Memorizing the Training Data

24 Mar 2024 13:24

Yet Another Inadequate Placeholder

One of the interesting/surprising phenomena of the current (say 2014--) revival of neural networks, a.k.a. "deep learning", is the phenomenon that models which clearly have the capacity to exactly memorize their training data can nonetheless generalize successfully to new data points. This struck many of us --- myself very much included --- as very weird, even disturbing. In retrospect, we should not have been so surprised: you could see it already with boosting, where continued boosting kept improving generalization performance even past the point where it didn't change the fit to the training data. For that matter, back at what seems like the dawn of time, Cover and Cover and Hart showed that the one-nearest-neighbor method, the original "memorize the training data" approach, has an asymptotic risk \( \leq 2 \) the risk of the optimal decision rule (under some smoothness assumptions).

Obviously I don't think that the phenomenon of interpolating-yet-generalizing actually contradicts anything we proved in learning theory. But I also do not feel altogether easy about how to fit the parts together. Hence this notebook, which for right now is mostly a collection of stuff I should work through with pencil-and-paper. \[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]

Let me offer a first stab at explaining what's bugging me, which will (I realize) be very inadequate, but it's on my mind because I've just been teaching closely-related stuff. Think about doing a regression of \( Y \) on (multivariate) \( X \), with data points \( (X_1, Y_1), \ldots (X_n, Y_n) \). There's some true regression function $\TrueRegFunc(x) \equiv \Expect{Y|X=x}$, so $Y=\TrueRegFunc(X) + \epsilon$, where $\Expect{\epsilon|X=x}=0$ for all $x$. Let's even say that $\Var{\epsilon|X=x}=\sigma^2$, to simplify the book-keeping below. We take our data and come up with an estimated regression function $\EstRegFunc$. Now we ask about the expected squared error if we see a new data set, where the \( X \) values are the same, but \( Y_i^{\prime} = \TrueRegFunc(x_i) + \epsilon_i^{\prime} \), and $\epsilon_i^{\prime}$ is independently drawn from the same ($x$-conditional) distribution as the original $\epsilon_i$. A little algebra says \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & (\Expect{Y_i^{\prime} - \EstRegFunc(x_i)})^2 + \Var{Y_i^{\prime} - \EstRegFunc(x_i)}\\ & = & \Var{\epsilon^{\prime}_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} \end{eqnarray} \] On the other hand, if we ask about the performance on the training data, we get \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & \Var{\epsilon_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)} - 2\Cov{Y_i, \EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)} - 2\Cov{\epsilon_i, \EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} - \text{optimism} \end{eqnarray} \] That is, the in-sample performance is over-optimistic about the performance on new data by an amount that reflects how well the method can memorize the noise. If we average over all the data points, \begin{eqnarray} \text{new-data risk} - \text{in-sample MSE} & = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2}} - \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i - \EstRegFunc(x_i))^2}}\\ & = & \frac{2}{n}\sum_{i=1}^{n}{\Cov{\epsilon_i, \EstRegFunc(x_i)}} \end{eqnarray}

For a linear smoother, where $\EstRegFunc(x) = \sum_{j=1}^{n}{w(x, x_j) y_j}$, we can build the hat or influence matrix \( \mathbf{w}_{ij} = w(x_i, x_j) \) and define the number of effective degrees of freedom as \( edf(\EstRegFunc) = \mathrm{tr} \mathbf{w} \); then we end up saying that the expected difference between in-sample and new performance is $\frac{\sigma^2}{n} edf(\EstRegFunc)$.

Now what does this (classical) set of results about effective degrees of freedom, optimism and covariance penalties say about interpolation? Well, if we perfectly match the data points, $\EstRegFunc(x_i) = Y_i = \TrueRegFunc(x_i) + \epsilon_i$. It follows that our risk for predicting the \( Y_i^{\prime} \) will be not 0, or even $\sigma^2$ (the risk of the Oracle who knows $\TrueRegFunc$), but $2\sigma^2$. (As with one-nearest-neighbors.) Which is true! But notice that this doesn't at all distinguish between different interpolators --- they might all have risk $2\sigma^2$ if we hold the \( x_i \) fixed, but if those vary randomly, some interpolators will do decidedly better than others...

(Also, if anyone was sent here when here looking for information about "interpolation" in the curve-fitting / numerical-analysis sense: Sorry! I hope everyone enjoyed the comparatively-brief period when search engines worked well!)