Interpolation in Statistical Learning, or, Memorizing the Training Data

Last update: 07 Jul 2025 13:33
First version: 6 March 2024

One of the interesting/surprising phenomena of the current (say 2014--) revival of neural networks, a.k.a. "deep learning", is the phenomenon that models which clearly have the capacity to exactly memorize their training data can nonetheless generalize successfully to new data points. This struck many of us --- myself very much included --- as very weird, even disturbing. In retrospect, we should not have been so surprised: you could see it already with boosting, where continued boosting kept improving generalization performance even past the point where it didn't change the fit to the training data. For that matter, back at what seems like the dawn of time, Cover and Cover and Hart showed that the one-nearest-neighbor method, the original "memorize the training data" approach, has an asymptotic risk $ \leq 2 $ the risk of the optimal decision rule (under some smoothness assumptions).

Obviously I don't think that the phenomenon of interpolating-yet-generalizing actually contradicts anything we proved in learning theory. But I also do not feel altogether easy about how to fit the parts together. Hence this notebook, which for right now is mostly a collection of stuff I should work through with pencil-and-paper. \[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]

Let me offer a first stab at explaining what's bugging me, which will (I realize) be very inadequate, but it's on my mind because I've just been teaching closely-related stuff. Think about doing a regression of $ Y $ on (multivariate) $ X $, with data points $ (X_1, Y_1), \ldots (X_n, Y_n) $. There's some true regression function $\TrueRegFunc(x) \equiv \Expect{Y|X=x}$, so $Y=\TrueRegFunc(X) + \epsilon$, where $\Expect{\epsilon|X=x}=0$ for all $x$. Let's even say that $\Var{\epsilon|X=x}=\sigma^2$, to simplify the book-keeping below. We take our data and come up with an estimated regression function $\EstRegFunc$. Now we ask about the expected squared error if we see a new data set, where the $ X $ values are the same, but $ Y_i^{\prime} = \TrueRegFunc(x_i) + \epsilon_i^{\prime} $, and $\epsilon_i^{\prime}$ is independently drawn from the same ($x$-conditional) distribution as the original $\epsilon_i$. A little algebra says \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & (\Expect{Y_i^{\prime} - \EstRegFunc(x_i)})^2 + \Var{Y_i^{\prime} - \EstRegFunc(x_i)}\\ & = & \Var{\epsilon^{\prime}_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} \end{eqnarray} \] On the other hand, if we ask about the performance on the training data, we get \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & \Var{\epsilon_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)} - 2\Cov{Y_i, \EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)} - 2\Cov{\epsilon_i, \EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} - \text{optimism} \end{eqnarray} \] That is, the in-sample performance is over-optimistic about the performance on new data by an amount that reflects how well the method can memorize the noise. If we average over all the data points, \begin{eqnarray} \text{new-data risk} - \text{in-sample MSE} & = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2}} - \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i - \EstRegFunc(x_i))^2}}\\ & = & \frac{2}{n}\sum_{i=1}^{n}{\Cov{\epsilon_i, \EstRegFunc(x_i)}} \end{eqnarray}

For a linear smoother, where $\EstRegFunc(x) = \sum_{j=1}^{n}{w(x, x_j) y_j}$, we can build the hat or influence matrix $ \mathbf{w}_{ij} = w(x_i, x_j) $ and define the number of effective degrees of freedom as $ edf(\EstRegFunc) = \mathrm{tr} \mathbf{w} $; then we end up saying that the expected difference between in-sample and new performance is $\frac{\sigma^2}{n} edf(\EstRegFunc)$.

Now what does this (classical) set of results about effective degrees of freedom, optimism and covariance penalties say about interpolation? Well, if we perfectly match the data points, $\EstRegFunc(x_i) = Y_i = \TrueRegFunc(x_i) + \epsilon_i$. It follows that our risk for predicting the $ Y_i^{\prime} $ will be not 0, or even $\sigma^2$ (the risk of the Oracle who knows $\TrueRegFunc$), but $2\sigma^2$. (As with one-nearest-neighbors.) Which is true! But notice that this doesn't at all distinguish between different interpolators --- they might all have risk $2\sigma^2$ if we hold the $ x_i $ fixed, but if those vary randomly, some interpolators will do decidedly better than others...

Update, 2 April 2025: It turns out someone had actually done something along the lines of the previous paragraph, instead of (like me) just noodling about it, and done it by the time I was writing: see Curth et al. (2023) below. (I should write more about this paper, which is great.)

(Also, if anyone was sent here when here looking for information about "interpolation" in the numerical-analysis, draw-a-curve-through-given-points sense: Sorry! I hope everyone enjoyed the comparatively-brief period when search engines worked!)

Mikhail Belkin, "Fit without fear: Remarkable mathematical phenomena of deep learning through the prism of interpolation", arxiv:2105.14368
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization", Communications of the ACM 64 (2021): 107--115 [previous version: arxiv:1611.03530]

Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov, "Does data interpolation contradict statistical optimality?", arxiv:1806.09471
Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", arxiv:1903.08560 [To be candid, while this made perfect sense when Ryan came from his office to mine to excitedly explain it, I should really work it through step-by-step...]
Alicia Curth, Alan Jeffares, and Mihaela van der Schaar, "A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning", NeurIPS 2023, arxiv:2310.18988

Peter L. Bartlett, Philip M. Long, "Failures of model-dependent generalization bounds for least-norm interpolation", arxiv:2010.08479
Peter L. Bartlett, Andrea Montanari, Alexander Rakhlin, "Deep learning: a statistical viewpoint", arxiv:2103.09177
Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, Pascal Vincent, "The Pitfalls of Memorization: When Memorization Hurts Generalization", arxiv:2412.07684
Mikhail Belkin, Daniel Hsu, Partha Mitra, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", arxiv:1806.05161
Mikhail Belkin, Siyuan Ma, Soumik Mandal, "To understand deep learning we need to understand kernel learning", arxiv:1802.01396
Sébastien Bubeck, Mark Sellke, "A Universal Law of Robustness via Isoperimetry", arxiv:2105.12806
Michael Celentano, Theodor Misiakiewicz, Andrea Montanari, "Minimum complexity interpolation in random features models", arxiv:2103.15996
Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi, "Multiple Descent: Design Your Own Generalization Curve", arxiv:2008.01036
Chen Cheng, John Duchi, Rohith Kuditipudi, "Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression", arxiv:2202.09889
Hung-Hsu Chou, Holger Rauhut, Rachel Ward, "Robust Implicit Regularization via Weight Normalization", arxiv:2305.05448
Alicia Curth, "Classical Statistical (In-Sample) Intuitions Don't Generalize Well: A Note on Bias-Variance Tradeoffs, Overfitting and Moving from Fixed to Random Designs", arxiv:2409.18842
Verna Dankers, Ivan Titov, Dieuwke Hupkes, "Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation", arxiv:2311.05379
Marina Dubova, "Generalizing with overly complex representations", InfoCog workshop at NeurIPS 2022
Marina Dubova and Sabina J. Sloman, "Excess Capacity Learning", Proceedings of the 45th Annual Meeting of the Cognitive Science Society (2023)
Weinan E, Chao Ma, Lei Wu, "The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks", Pure and Applied Functional Analysis 5 (2020): 1145--1460, arxiv:1912.06987
Vitaly Feldman, "Does Learning Require Memorization? A Short Tale about a Long Tail", arxiv:1906.05271
Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney
- "The Interpolating Information Criterion for Overparameterized Models", arxiv:2307.07785
- "A PAC-Bayesian Perspective on the Interpolating Information Criterion", arxiv:2311.07013
Michael Kohler, Adam Krzyzak, "Over-parametrized deep neural networks do not generalize well", arxiv:1912.03925
Tengyuan Liang, Alexander Rakhlin, "Just interpolate: Kernel 'Ridgeless' regression can generalize", Annals of Statistics 48 (2020): 1329--1347
Tengyuan Liang, Benjamin Recht, "Interpolating Classifiers Make Few Mistakes", arxiv:2101.11815
Bo Luan, Yoonkyung Lee, Yunzhang Zhu, "Predictive Model Degrees of Freedom in Linear Regression", arxiv:2106.15682
Naren Sarayu Manoj, Nathan Srebro, "Interpolation Learning With Minimum Description Length", arxiv:2302.07263
Song Mei, Andrea Montanari, "The generalization error of random features regression: Precise asymptotics and double descent curve", arxiv:1908.05355
Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, Anant Sahai, "Harmless interpolation of noisy data in regression", arxiv:1903.09139
Ryumei Nakada, Masaaki Imaizumi, "Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks", arxiv:2103.00500
Jason W. Rocks and Pankaj Mehta, "Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models", Physical Review Research 4 (2022): 013201 [This turns out to be an 18 page abstract of / introduction to the 32 pag "supplemental material", which has all the actual calculations, i.e., is the real paper. (I realize this is an increasingly common practice, but I hate it and I will complain about it whenever I get the chance.) So I need to tackle the supplement, because the claimed results are interesting. (But I thought everyone knew linear models interpret nonlinearities as noise?)]
Yue Xing, Qifan Song, Guang Cheng, "Benefit of Interpolation in Nearest Neighbor Algorithms", arxiv:1909.11720
Zitong Yang, Yu Bai, Song Mei, "Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models", arxiv:2103.04554