How Statistics and Machine Learning Came To Have Two Different Kinds of Kernel Methods

Attention conservation notice: Re-purposed teaching materials, about a confusing point of terminology in two arcane disciplines. This is amateur history of science, which will not help you learn or practice those disciplines, even if you wanted to (which you don't). Also, it was written for an advanced undergraduate class in one of those disciplines, and presumes some familiarity with the jargon and concepts.
I have written several versions of this over the years, for various classes; this one is from 18 February 2026, for a class which emphasized (Nadaraya-Watson) kernel smoothing, splines, and kernel density estimation. Posted now ~~for lack of other material~~ to refer to in the future, rather than re-writing yet again.

TL;DR: In statistics and machine learning, "kernel methods" refer to two different families of methods, one based on convolution, the other on hiding a basis expansion in the guise of a sum over data points. These use two different (but overlapping) sets of kernels. In both cases, the name "kernel" comes from some problems involving integrals in mathematical physics. (TL;DR of the TL;DR: Blame the mathematicians.)

"Kernel" is one of those terms which is used in many distinct-but-related senses across different areas of mathematics (like "normal"). The common metaphor is "the seed (in some sense) from which some larger object or structure grows (in some sense)". In particular, in physics, there are a lot of problems which involve integral operators $ \mathcal{I} $ that map one function, say $ f $, to a new function $ \mathcal{I} f $, by the relationship \[ (\mathcal{I}f)(x) = \int{f(z) K(x, z) dz} \] The inner function $ K(x,z) $ is called the kernel of the operator. (Or at least that's what it came to be called in English; a lot of this was first worked out in German, in the ~~1800s, and I don't know the original German technical term~~ very early 1900s, and the German word was Kern. [See update below.])

A particularly important class of integral operators, both in physics and in a lot of other fields, take the form \[ \int{f(z) G(x-z) dz} \] That is, the kernel isn't a two-argument function, of $ x $ and $ z $, but another one-argument function, only involving the difference between $ x $ and $ z $, say $ u = x-z $. This came to be called (in English) the convolution of the two functions $ f $ and $ G $. (The German original was Faltung; as late as 1933, an American mathematician writing about this uses that term because "there is no good English word" (*). I am told one would ordinarily translate this as something like "folding".) It was recognized a long time ago that if $ G(u) $ is a probability density function (pdf), then $ \int{f(z) G(x-z) dz} $ gives a weighted average of all the values of $ f $, with the weight given to $ f(z) $ depending on how close $ x $ is to $ z $. This implies that $ \overline{f}(x) = \int{f(z) G(x-z) dz} $ is a new function, related to $ f(x) $, but smoother, because $ \overline{f} $ is averaging out the oscillations and extremes of $ f $.

In the 1950s, the leading statisticians were, overwhelmingly, trained as mathematicians, and a lot of that mathematics was stuff which had grown out of mathematical physics, so a great deal of this mathematics was very familiar to them. (They were creating the graduate programs which would train statisticians as statisticians.) In particular, one cluster of statisticians who were doing (what we'd now call) signal processing were very interested in estimating the power spectra of radio (and other) signals --- how much energy was transmitted at each frequency. (This turns out to be very important for prediction and control.) The raw power spectra were extremely noisy, and the statisticians realized that by convolving those raw spectra with Gaussian (or other) kernels, they could get something much smoother and more stable, in effect trading some bias for a very large variance reduction. It was then quickly realized that the same idea could be used to estimate probability densities, convolving the empirical distribution with smooth kernels. And a few years after that, in 1964, Nadaraya and Watson (independently) realized that you could use this to do regression.

So the "kernel" in "kernel density estimation" and "kernel smoothing" is kernel in the sense of function-you-convolve-the-data-with.

There is another set of methods, also called "kernel methods", which look rather different. Go back to how I defined integral operators: \[ (\mathcal{I}f)(x) = \int{f(z) K(x,z) dz} \] This is a linear transformation on functions: $ \mathcal{I}(af + bg) = a\mathcal{I}f + b \mathcal{I}g $, for any scalars $ a, b $ and functions $ f, g $. Just as multiplying by a matrix is a linear transformation on vectors, integral operators are (one kind of) linear transformation of functions. Just as matrices have eigenvectors, integral operators have eigenfunctions, where \[ \mathcal{I}\phi = \lambda \phi \] for some scalar $ \lambda $, the eigenvalue. (Eigen- is German again; roughly "self-" or "own-".) For some integral operators, i.e., for some kernels $ K(x,z) $, the eigenfunctions $ \phi_1, \phi_2, \ldots $ actually form a basis, meaning that, for any (well-behaved) function $ f $ \[ f(x) = \sum_{i=1}^{\infty}{c_i \phi_i(x)} \] (Alternately, we can always define a space of functions as "everything we can get by taking linear combinations of these eigenfunctions".)

Writing arbitrary functions as weighted sums of basis functions is a very old trick in math. Doing regression in terms of some set of basis functions is almost as old as a trick in statistics. So the eigenfunctions of (nice) integral operators can be the basis functions we use for regression.

That doesn't single them out from any other set of basis functions, but, again in the 1950s, people realized that kernels which satisfy some properties (like symmetry and non-negativity) can themselves be expressed in terms of the eigenfunctions and eigenvalues: \[ K(x,z) = \sum_{i=1}^{\infty}{\lambda_i \phi_i(x) \phi_i(z)} \] The usual (and correct!) explanation is to think of mapping $ x $ to the vector of function values ("features") $ (\phi_1(x), \phi_2(x) , \ldots ) $; doing the same thing to $ z $; and then taking the (weighted) inner product between those vectors. As we say: "$ K(x,z) $ is an inner product in feature space". Any linear method you can write in terms of inner products thus has a "kernelized" equivalent. (For example.)

This also implies that a weighted sum of kernel functions is equivalent to a weighted sum of eigenfunctions: \[ \sum_{i=1}^{n}{a_i K(x, x_i)} = \sum_{j=1}^{\infty}{b_j \phi_j(x)} \] (EXERCISE: Find an expression for $ b_j $.)

So doing a weighted sum of kernel functions (in this sense) is equivalent to doing an infinite weighted sum of these eigenfunctions (which form a basis). In fact, a lot of the time we can show that optimal function of the form $ \sum_{j=1}^{\infty}{b_j \phi_j(x)} $ can in fact be written in the form $ \sum_{i=1}^{n}{a_i K(x, x_i)} $, so that we really only have a finite-dimensional optimization problem. (Such a result is called a "representer theorem".) This was important when people, like the statistician Grace Wahba, worked out the mathematical details of smoothing splines. (One can actually write out the kernel, in this sense, that's implicit in spline smoothing, but I do not find it very illuminating.) (**)

What came to be called "kernel methods" or "kernel machines", in the 1990s, were predictive models of the form $ \sum_{i=1}^{n}{a_i K(x, x_i)} $, where, again, $ K $ is one of those two-argument kernels which lead to nice eigenfunctions and eigenvalues when plugged into an integral operator. The goal wasn't really smoothing (as it was with the convolution methods), but to get the power of using a huge --- even an infinite! --- set of basis functions, without having to explicitly estimate a huge set of coefficients, or evaluate a huge set of functions when making predictions. People talked about "the kernel trick" as this way of using two-argument kernels to implicitly use vast function spaces, without explicitly calculating the functions.

There are some kernels, in this sense, which also work as kernels, in the smoothing sense (e.g., Gaussians). But the two sets of kernels are distinct, and the two sets of methods are really distinct. If you do kernel smoothing with a Gaussian kernel, there is just no way to write that as a sum of Gaussian kernels with fixed weights. In general, kernel smoothing regression give us predictions of the form \[ S_G(x) = \sum_{i=1}^{n}{\frac{G(x-x_i)}{\sum_{j=1}^{n}{G(x-x_j)}} y_i} \] when we've seen data points $ (x_1, y_1), \ldots (x_n, y_n) $. In contrast, kernel regression in the implicit-function-expansion sense gives us predictions of the form \[ R_K(x) = \sum_{i=1}^{n}{\alpha_i K(x, x_i)} \] It is an easy EXERCISE to show that if $ K(x, x_i) = G(x-x_i) $, and $ G(u) $ is a (non-degenerate) pdf, then there is no set of weights $ \alpha_1, \ldots \alpha_n $ which will make $ S_G(x) = R_G(x) $ for all $ x $. It is a harder EXERCISE to show that there is no combination of weights and two-argument kernel $ K $ which will make $ S_G(x) = R_K(x) $ for all $ x $.

If, back in the 1950s, the one line of work had talked about "convolutional smoothing" and the other "implicit basis function expansions" (or something like that), we would not have this confusion.

Update, 6 March 2026: On the other hand, at least the convolutional smoothing people did not give their technique a misleadingly psychological name...

Updates, 11 March 2025:

Consulting the German original of Courant and Hilbert shows that the German term was Kern --- or at least, that's what was by 1924. (I should, of course, have thought to check this source before, but I didn't realize it was online!) If I can trust some German-English dictionaries, the primary meaning is something like "the edible part inside a hard-shelled nut or the skin of a fruit", which is very similar to the most basic meaning of the English "kernel". Turning to their citations, Courant and Hilbert refer to a fundamental 1903 publication on integral equations (and so on integral transforms) by Fredholm. This paper does not (so far as I can tell) ever introduce a name for what we'd call the kernel, it's «la fonction $ f(x,y) $» throughout. (Certainly «noyau», the modern French term, does not appear in Fredholm's paper.) Courant and Hilbert also cite Maxime Boôcher's 1909 Introduction to the Theory of Integral Equations (written in English), which flatly states (p. 13) "$ K $ is called the kernel of these equations", and adds, in a footnote, that the term was "first employed by Hilbert" in 1904 (in German). So "kernel" would seem to have been fixed as the English translation of the German Kern (in this context) some time between 1904 and 1909.
The oldest appearance of "kernel" in the Annals of Mathematical Statistics is indeed Parzen (1961). By 1967, people proposing other modes of density estimation refer to the convolutional method as the "'kernel' method" or "'kernel' technique" (with the scare quotes), indicating that the term was in use but still novel.

*: Norbert Wiener, The Fourier Integral and Certain of Its Applications (Cambridge, England: Cambridge University Press, 1933), p. 45. ^

**: The oldest example I have run across of using kernel methods (in this sense) in statistics is Emanuel Parzen's work (1960/1963, 1961) on time series analysis, where it was motivated, in part, as a way around having to estimate power spectra; despite Parzen's eminence in statistics, this approach does not seem to have been much used, at least not then. (I'm sure it's no coincidence that Parzen was Wahba's doctoral adviser!) What makes this extra curious is that Parzen also wrote a 1962 paper where he (more-or-less) introduced convolutional kernel density estimation (in which he cites prior work on estimating power spectra), and that did take off. Someone with enough knowledge of mathematics and statistics, and access to his archives, could write a history-of-science paper on what led Parzen, in particular, to introduce both kinds of kernel methods into statistics. This would fascinate, oh, easily a dozen people other than myself. ^

Enigmas of Chance; Mathematics

Posted at February 27, 2026 11:46 | permanent link

Three-Toed Sloth

February 27, 2026

How Statistics and Machine Learning Came To Have Two Different Kinds of Kernel Methods