## Interpolation in Statistical Learning, or, Memorizing the Training Data

*24 Mar 2024 13:24*

Yet Another Inadequate Placeholder

One of the interesting/surprising phenomena of the current (say 2014--)
revival of neural networks, a.k.a. "deep
learning", is the phenomenon that models which clearly have the capacity to
exactly memorize their training data can nonetheless generalize successfully to
new data points. This struck many of us --- myself very much included --- as
very weird, even disturbing. In retrospect, we should not have
been *so* surprised: you could see it already with
boosting, where
continued boosting kept improving generalization performance even past the
point where it didn't change the fit to the training data. For that matter,
back at what seems like the dawn of time, Cover and Cover and Hart showed that
the one-nearest-neighbor method, the
original "memorize the training data" approach, has an asymptotic risk \( \leq
2 \) the risk of the optimal decision rule (under some smoothness assumptions).

Obviously I don't think that the phenomenon of
interpolating-yet-generalizing actually *contradicts* anything we proved
in learning theory. But I also do not feel
altogether easy about how to fit the parts together. Hence this notebook,
which for right now is mostly a collection of stuff I should work through with
pencil-and-paper.
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\TrueRegFunc}{\mu}
\newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\]

Let me offer a first stab at explaining what's bugging me, which will (I realize) be very inadequate, but it's on my mind because I've just been teaching closely-related stuff. Think about doing a regression of \( Y \) on (multivariate) \( X \), with data points \( (X_1, Y_1), \ldots (X_n, Y_n) \). There's some true regression function $\TrueRegFunc(x) \equiv \Expect{Y|X=x}$, so $Y=\TrueRegFunc(X) + \epsilon$, where $\Expect{\epsilon|X=x}=0$ for all $x$. Let's even say that $\Var{\epsilon|X=x}=\sigma^2$, to simplify the book-keeping below. We take our data and come up with an estimated regression function $\EstRegFunc$. Now we ask about the expected squared error if we see a new data set, where the \( X \) values are the same, but \( Y_i^{\prime} = \TrueRegFunc(x_i) + \epsilon_i^{\prime} \), and $\epsilon_i^{\prime}$ is independently drawn from the same ($x$-conditional) distribution as the original $\epsilon_i$. A little algebra says \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & (\Expect{Y_i^{\prime} - \EstRegFunc(x_i)})^2 + \Var{Y_i^{\prime} - \EstRegFunc(x_i)}\\ & = & \Var{\epsilon^{\prime}_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} \end{eqnarray} \] On the other hand, if we ask about the performance on the training data, we get \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & \Var{\epsilon_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)} - 2\Cov{Y_i, \EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)} - 2\Cov{\epsilon_i, \EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} - \text{optimism} \end{eqnarray} \] That is, the in-sample performance is over-optimistic about the performance on new data by an amount that reflects how well the method can memorize the noise. If we average over all the data points, \begin{eqnarray} \text{new-data risk} - \text{in-sample MSE} & = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2}} - \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i - \EstRegFunc(x_i))^2}}\\ & = & \frac{2}{n}\sum_{i=1}^{n}{\Cov{\epsilon_i, \EstRegFunc(x_i)}} \end{eqnarray}

For a linear smoother, where $\EstRegFunc(x) = \sum_{j=1}^{n}{w(x, x_j) y_j}$, we can build the hat or influence matrix \( \mathbf{w}_{ij} = w(x_i, x_j) \) and *define* the number of effective degrees of freedom as \( edf(\EstRegFunc) = \mathrm{tr} \mathbf{w} \); then we end up saying that the expected difference between in-sample and new performance is $\frac{\sigma^2}{n} edf(\EstRegFunc)$.

Now what does this (classical) set of results about effective degrees of freedom, optimism and covariance penalties say about interpolation? Well, if
we perfectly match the data points, $\EstRegFunc(x_i) = Y_i = \TrueRegFunc(x_i) + \epsilon_i$. It follows that our risk for predicting the \( Y_i^{\prime} \) will be not 0, or even $\sigma^2$ (the risk of the Oracle who knows $\TrueRegFunc$), but $2\sigma^2$. (As with one-nearest-neighbors.) Which is true! But notice that this doesn't at all distinguish between *different* interpolators --- they might all have risk $2\sigma^2$ if we hold the \( x_i \) fixed, but if those vary randomly, some interpolators will do decidedly better than others...

(Also, if anyone was sent here when here looking for information about "interpolation" in the curve-fitting / numerical-analysis sense: Sorry! I hope everyone enjoyed the comparatively-brief period when search engines worked well!)

- See also:
- Learning Theory
- Model Selection
- Random Features

- Recommended, big picture:
- Mikhail Belkin, "Fit without fear: Remarkable mathematical phenomena of deep learning through the prism of interpolation", arxiv:2105.14368
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization",
Communications of the ACM
**64**(2021): 107--115 [previous version: arxiv:1611.03530]

- Recommended, close-ups:
- Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov, "Does data interpolation contradict statistical optimality?", arxiv:1806.09471
- Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", arxiv:1903.08560 [To be candid, while this made perfect sense when Ryan came from his office to mine to excitedly explain it, I should really re-read the paper and work it through step-by-step...]

- To read:
- Peter L. Bartlett, Philip M. Long, "Failures of model-dependent generalization bounds for least-norm interpolation", arxiv:2010.08479
- Peter L. Bartlett, Andrea Montanari, Alexander Rakhlin, "Deep learning: a statistical viewpoint", arxiv:2103.09177
- Mikhail Belkin, Daniel Hsu, Partha Mitra, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", arxiv:1806.05161
- Mikhail Belkin, Siyuan Ma, Soumik Mandal, "To understand deep learning we need to understand kernel learning", arxiv:1802.01396
- Sébastien Bubeck, Mark Sellke, "A Universal Law of Robustness via Isoperimetry", arxiv:2105.12806
- Michael Celentano, Theodor Misiakiewicz, Andrea Montanari, "Minimum complexity interpolation in random features models", arxiv:2103.15996
- Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi, "Multiple Descent: Design Your Own Generalization Curve", arxiv:2008.01036
- Chen Cheng, John Duchi, Rohith Kuditipudi, "Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression", arxiv:2202.09889
- Hung-Hsu Chou, Holger Rauhut, Rachel Ward, "Robust Implicit Regularization via Weight Normalization", arxiv:2305.05448
- Marina Dubova, "Generalizing with overly complex representations", InfoCog workshop at NeurIPS 2022
- Marina Dubova and Sabina J. Sloman, "Excess Capacity Learning", Proceedings of the 45th Annual Meeting of the Cognitive Science Society (2023)
- Weinan E, Chao Ma, Lei Wu, "The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks",
Pure and Applied Functional Analysis
**5**(2020): 1145--1460, arxiv:1912.06987 - Vitaly Feldman, "Does Learning Require Memorization? A Short Tale about a Long Tail", arxiv:1906.05271
- Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney
- "The Interpolating Information Criterion for Overparameterized Models", arxiv:2307.07785
- "A PAC-Bayesian Perspective on the Interpolating Information Criterion", arxiv:2311.07013

- Michael Kohler, Adam Krzyzak, "Over-parametrized deep neural networks do not generalize well", arxiv:1912.03925
- Tengyuan Liang, Alexander Rakhlin, "Just interpolate: Kernel 'Ridgeless' regression can generalize", Annals of Statistics
**48**(2020): 1329--1347 - Tengyuan Liang, Benjamin Recht, "Interpolating Classifiers Make Few Mistakes", arxiv:2101.11815
- Bo Luan, Yoonkyung Lee, Yunzhang Zhu, "Predictive Model Degrees of Freedom in Linear Regression", arxiv:2106.15682
- Naren Sarayu Manoj, Nathan Srebro, "Interpolation Learning With Minimum Description Length", arxiv:2302.07263
- Song Mei, Andrea Montanari, "The generalization error of random features regression: Precise asymptotics and double descent curve", arxiv:1908.05355
- Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, Anant Sahai, "Harmless interpolation of noisy data in regression", arxiv:1903.09139
- Jason W. Rocks and Pankaj Mehta, "Memorizing without overfitting:
Bias, variance, and interpolation in overparameterized
models", Physical
Review Research
**4**(2022): 013201 [This turns out to be an 18 page abstract of / introduction to the 32 pag "supplemental material", which has all the actual calculations, i.e., is the real paper. (I realize this is an increasingly common practice, but I hate it and I will complain about it whenever I get the chance.) So I need to tackle the supplement, because the claimed results are interesting. (But I thought everyone knew linear models interpret nonlinearities as noise?)] - Yue Xing, Qifan Song, Guang Cheng, "Benefit of Interpolation in Nearest Neighbor Algorithms", arxiv:1909.11720
- Zitong Yang, Yu Bai, Song Mei, "Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models", arxiv:2103.04554