Interpolation in Statistical Learning, or, Memorizing the Training Data
24 Mar 2024 13:24
Yet Another Inadequate Placeholder
One of the interesting/surprising phenomena of the current (say 2014--) revival of neural networks, a.k.a. "deep learning", is the phenomenon that models which clearly have the capacity to exactly memorize their training data can nonetheless generalize successfully to new data points. This struck many of us --- myself very much included --- as very weird, even disturbing. In retrospect, we should not have been so surprised: you could see it already with boosting, where continued boosting kept improving generalization performance even past the point where it didn't change the fit to the training data. For that matter, back at what seems like the dawn of time, Cover and Cover and Hart showed that the one-nearest-neighbor method, the original "memorize the training data" approach, has an asymptotic risk \( \leq 2 \) the risk of the optimal decision rule (under some smoothness assumptions).
Obviously I don't think that the phenomenon of interpolating-yet-generalizing actually contradicts anything we proved in learning theory. But I also do not feel altogether easy about how to fit the parts together. Hence this notebook, which for right now is mostly a collection of stuff I should work through with pencil-and-paper. \[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \]
Let me offer a first stab at explaining what's bugging me, which will (I realize) be very inadequate, but it's on my mind because I've just been teaching closely-related stuff. Think about doing a regression of \( Y \) on (multivariate) \( X \), with data points \( (X_1, Y_1), \ldots (X_n, Y_n) \). There's some true regression function $\TrueRegFunc(x) \equiv \Expect{Y|X=x}$, so $Y=\TrueRegFunc(X) + \epsilon$, where $\Expect{\epsilon|X=x}=0$ for all $x$. Let's even say that $\Var{\epsilon|X=x}=\sigma^2$, to simplify the book-keeping below. We take our data and come up with an estimated regression function $\EstRegFunc$. Now we ask about the expected squared error if we see a new data set, where the \( X \) values are the same, but \( Y_i^{\prime} = \TrueRegFunc(x_i) + \epsilon_i^{\prime} \), and $\epsilon_i^{\prime}$ is independently drawn from the same ($x$-conditional) distribution as the original $\epsilon_i$. A little algebra says \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & (\Expect{Y_i^{\prime} - \EstRegFunc(x_i)})^2 + \Var{Y_i^{\prime} - \EstRegFunc(x_i)}\\ & = & \Var{\epsilon^{\prime}_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} \end{eqnarray} \] On the other hand, if we ask about the performance on the training data, we get \[ \begin{eqnarray} \Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2} & = & \Var{\epsilon_i} + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + + \Var{\EstRegFunc(x_i)} - 2\Cov{Y_i, \EstRegFunc(x_i)}\\ & = & \sigma^2 + (\Expect{\EstRegFunc(x_i)} - \TrueRegFunc(x_i))^2 + \Var{\EstRegFunc(x_i)} - 2\Cov{\epsilon_i, \EstRegFunc(x_i)}\\ & = & \text{system noise} + \text{bias}^2 + \text{estimation variance} - \text{optimism} \end{eqnarray} \] That is, the in-sample performance is over-optimistic about the performance on new data by an amount that reflects how well the method can memorize the noise. If we average over all the data points, \begin{eqnarray} \text{new-data risk} - \text{in-sample MSE} & = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i^{\prime} - \EstRegFunc(x_i))^2}} - \frac{1}{n}\sum_{i=1}^{n}{\Expect{(Y_i - \EstRegFunc(x_i))^2}}\\ & = & \frac{2}{n}\sum_{i=1}^{n}{\Cov{\epsilon_i, \EstRegFunc(x_i)}} \end{eqnarray}
For a linear smoother, where $\EstRegFunc(x) = \sum_{j=1}^{n}{w(x, x_j) y_j}$, we can build the hat or influence matrix \( \mathbf{w}_{ij} = w(x_i, x_j) \) and define the number of effective degrees of freedom as \( edf(\EstRegFunc) = \mathrm{tr} \mathbf{w} \); then we end up saying that the expected difference between in-sample and new performance is $\frac{\sigma^2}{n} edf(\EstRegFunc)$.
Now what does this (classical) set of results about effective degrees of freedom, optimism and covariance penalties say about interpolation? Well, if we perfectly match the data points, $\EstRegFunc(x_i) = Y_i = \TrueRegFunc(x_i) + \epsilon_i$. It follows that our risk for predicting the \( Y_i^{\prime} \) will be not 0, or even $\sigma^2$ (the risk of the Oracle who knows $\TrueRegFunc$), but $2\sigma^2$. (As with one-nearest-neighbors.) Which is true! But notice that this doesn't at all distinguish between different interpolators --- they might all have risk $2\sigma^2$ if we hold the \( x_i \) fixed, but if those vary randomly, some interpolators will do decidedly better than others...
(Also, if anyone was sent here when here looking for information about "interpolation" in the curve-fitting / numerical-analysis sense: Sorry! I hope everyone enjoyed the comparatively-brief period when search engines worked well!)
- See also:
- Learning Theory
- Model Selection
- Random Features
- Recommended, big picture:
- Mikhail Belkin, "Fit without fear: Remarkable mathematical phenomena of deep learning through the prism of interpolation", arxiv:2105.14368
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals, "Understanding deep learning (still) requires rethinking generalization", Communications of the ACM 64 (2021): 107--115 [previous version: arxiv:1611.03530]
- Recommended, close-ups:
- Mikhail Belkin, Alexander Rakhlin, Alexandre B. Tsybakov, "Does data interpolation contradict statistical optimality?", arxiv:1806.09471
- Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani, "Surprises in High-Dimensional Ridgeless Least Squares Interpolation", arxiv:1903.08560 [To be candid, while this made perfect sense when Ryan came from his office to mine to excitedly explain it, I should really re-read the paper and work it through step-by-step...]
- To read:
- Peter L. Bartlett, Philip M. Long, "Failures of model-dependent generalization bounds for least-norm interpolation", arxiv:2010.08479
- Peter L. Bartlett, Andrea Montanari, Alexander Rakhlin, "Deep learning: a statistical viewpoint", arxiv:2103.09177
- Mikhail Belkin, Daniel Hsu, Partha Mitra, "Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate", arxiv:1806.05161
- Mikhail Belkin, Siyuan Ma, Soumik Mandal, "To understand deep learning we need to understand kernel learning", arxiv:1802.01396
- Sébastien Bubeck, Mark Sellke, "A Universal Law of Robustness via Isoperimetry", arxiv:2105.12806
- Michael Celentano, Theodor Misiakiewicz, Andrea Montanari, "Minimum complexity interpolation in random features models", arxiv:2103.15996
- Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi, "Multiple Descent: Design Your Own Generalization Curve", arxiv:2008.01036
- Chen Cheng, John Duchi, Rohith Kuditipudi, "Memorize to Generalize: on the Necessity of Interpolation in High Dimensional Linear Regression", arxiv:2202.09889
- Hung-Hsu Chou, Holger Rauhut, Rachel Ward, "Robust Implicit Regularization via Weight Normalization", arxiv:2305.05448
- Marina Dubova, "Generalizing with overly complex representations", InfoCog workshop at NeurIPS 2022
- Marina Dubova and Sabina J. Sloman, "Excess Capacity Learning", Proceedings of the 45th Annual Meeting of the Cognitive Science Society (2023)
- Weinan E, Chao Ma, Lei Wu, "The Generalization Error of the Minimum-norm Solutions for Over-parameterized Neural Networks", Pure and Applied Functional Analysis 5 (2020): 1145--1460, arxiv:1912.06987
- Vitaly Feldman, "Does Learning Require Memorization? A Short Tale about a Long Tail", arxiv:1906.05271
- Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney
- "The Interpolating Information Criterion for Overparameterized Models", arxiv:2307.07785
- "A PAC-Bayesian Perspective on the Interpolating Information Criterion", arxiv:2311.07013
- Michael Kohler, Adam Krzyzak, "Over-parametrized deep neural networks do not generalize well", arxiv:1912.03925
- Tengyuan Liang, Alexander Rakhlin, "Just interpolate: Kernel 'Ridgeless' regression can generalize", Annals of Statistics 48 (2020): 1329--1347
- Tengyuan Liang, Benjamin Recht, "Interpolating Classifiers Make Few Mistakes", arxiv:2101.11815
- Bo Luan, Yoonkyung Lee, Yunzhang Zhu, "Predictive Model Degrees of Freedom in Linear Regression", arxiv:2106.15682
- Naren Sarayu Manoj, Nathan Srebro, "Interpolation Learning With Minimum Description Length", arxiv:2302.07263
- Song Mei, Andrea Montanari, "The generalization error of random features regression: Precise asymptotics and double descent curve", arxiv:1908.05355
- Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, Anant Sahai, "Harmless interpolation of noisy data in regression", arxiv:1903.09139
- Jason W. Rocks and Pankaj Mehta, "Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models", Physical Review Research 4 (2022): 013201 [This turns out to be an 18 page abstract of / introduction to the 32 pag "supplemental material", which has all the actual calculations, i.e., is the real paper. (I realize this is an increasingly common practice, but I hate it and I will complain about it whenever I get the chance.) So I need to tackle the supplement, because the claimed results are interesting. (But I thought everyone knew linear models interpret nonlinearities as noise?)]
- Yue Xing, Qifan Song, Guang Cheng, "Benefit of Interpolation in Nearest Neighbor Algorithms", arxiv:1909.11720
- Zitong Yang, Yu Bai, Song Mei, "Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models", arxiv:2103.04554