2026

Search and Increasing Returns, or, No One Makes You Push to Github

Attention conservation notice: An economistic argument that computer networks are nigh-doomed to effective centralization, by someone who is neither an economist nor a computer scientist. Arcane, speculative, and not actionable even if correct. You would be better off spending your time reading a book.
Drafted in 2018, and deliberately not much updated. Posted now because I found myself re-using the joke of the subtitle in an e-mail.

Twitter is awful for many reasons, but not the least of them is the way it makes its users feel forced to keep using it. (One such complaint among many.) Early visions of how it could contribute to a beneficient or at least harmless ecosystem, such Steven Berlin Johnson's (*), presumed that it would be much less sticky, a more old-fashioned website people could leave. Now, more recently Johnson has written a spectacularly wrong-headed piece hailing blockchain as a blow for re-decentralizing the Web.

Of course, the primary working example of a block chain is Bitcoin, which is heavily centralized in holdings, in processing power, and in exchanges. And there are economies-of-scale reasons to expect this would be true of anything using either proof of work or proof of stake. But suppose the problem with Bitcoin is just that it's not got a very compelling use:

imagine if keeping your car idling 24/7 produced solved Sudokus you could trade for heroin --- @Theophite, 16 August 2018

and that nobody has a use for blockchains, not really.

Git repositories, on the other hand, are the part of the blockchain idea that's actually good for something: a way of tracking changes, with many authors, where it is very, very hard to go back and alter history without being caught. And git repositories are things that are in-principle easy to copy and move around, and could be hosted on any machine running HTTP (or, heck, FTP). So why does Github exist, let alone hold the position it does?

The answer, I think, comes in two parts. The first is that focal points are subject to increasing returns via network effects. The second is that search engines create and/or amplify focal points. Put these two together, and you get a very strong tendency for any particular line of activity --- be it trading solved sudokus for heroin or open-source software development --- to concentrate in just one, or at most a few, online locations.

Incidentally: This doesn't necessarily lock in first-mover advantage, because other forces can overcome this (revulsion, technical superiority), but it does mean that it will be very hard to avoid having at most a few dominant locations at any one time. It also means that the transitions between dominant locations will be brief.

If we really wanted to re-decentralize, we'd have to (1) get rid of focal points, or (2) get rid of search engines as we've known them, or (3) somehow make search point to a distributed, decentralized focal "point" (focal blob? focal rhizome?). I therefore strongly suspect we are not going to re-decentralize. But then I didn't expect the Web would centralize as much as it has, despite having literally learned my Brian Arthur and Paul David at my father's knee, so what do I know?

*: To be clear, I have been a fan of Johnson's books since Interface Culture and Emergence. I am picking on two of his essays, but that's because I think he is too ready to find encouraging signs for a certain vision of how the Internet could transform the world for the better. To be clear, I share the vision, but now hold out little hope for it.

The Dismal Science; Networks; Linkage

Posted at March 06, 2026 22:50 | permanent link

Statistical Complexity of Link Prediction

Attention conservation notice: Link to a two-page fragment of a mathematical paper that was abandoned over two decades ago.
I had not one, but two mentors in graduate school who, when approached with an idea or a question, were apt to go to their filing cabinet and pull out notes, from years or even decades back, which addressed that very issue, or one very close to it, and which they had never gotten around to finishing. (G. and C. had never met, and otherwise had little in common.) As a juvenile scientist, I found this both humbling and intensely irritating: why didn't they publish? As a middle-aged professor with a big directory of partially-finished projects, I have more sympathy, but still want to do better myself. I am therefore going to try to make a point of posting a lot of my fragments, and just ask that if someone decides to build on one, they put me in the acknowledgments.

In 2004, because I was thinking a lot about the statistical complexity of predicting stochastic processes, and learning about networks from Mark, Cris and Aaron, I tried my hand at defining statistical complexity for link prediction. The resulting complexity measure seemed straightforward-in-principle but too hard to calculate for anything very interesting (except maybe exponential-family random graphs, where it ends up being the entropy of increments to the minimal sufficient statistics).

In the ensuing 22 years, I have done literally nothing with the idea, but something reminded me of it the other day, so here's the two-page fragment I abandoned in March 2004.

The one thing I would add to the fragment is to consider a stochastic block model, where each node $ i $ has a latent discrete random variable $ X_i $, IIDly across nodes; $ X_i $ says which "block" (or "community" or "module", etc.) node $ i $ lives in. The probability of an edge between nodes $ i $ and $ j $ is a function of $ X_i $ and $ X_j $ alone, independent of all other dyads or anything else. The pair $ (X_i, X_j) $ is thus the "state" which fixes the distribution of the dyad. Of course, as pair of latent variables, this is not a statistic, a function of the observable graph $ G $. But there are many circumstances where, as we see larger and larger graphs, we can infer all the $ X_i $ from $ G $, with the probability of making any errors tending to zero. (Ed McFowland and I tried to summarize those conditions in our paper, because we wanted to use them as tools for something else.) In these situations, the sufficient statistic for predicting $ G_{ij} $ from the rest of the graph will in fact tend towards $ (X_i, X_j) $ as $ n \rightarrow \infty $, and so the limiting statistical forecasting complexity will be at most $ 2 H[X_i] $ (since $ X_i $ and $ X_j $ are IID). I say "at most" because there could be situations where distinct pairs of blocks have the same edge probability; if that's ruled out, the asymptotic statistical complexity will indeed be twice the entropy of the block variable for one node.

The same argument could extend to any graphon, if the node variables can be asymptotically recovered from the observed graph.

Enigmas of Chance; Networks; Complexity

Posted at March 06, 2026 21:09 | permanent link

How Statistics and Machine Learning Came To Have Two Different Kinds of Kernel Methods

Attention conservation notice: Re-purposed teaching materials, about a confusing point of terminology in two arcane disciplines. This is amateur history of science, which will not help you learn or practice those disciplines, even if you wanted to (which you don't). Also, it was written for an advanced undergraduate class in one of those disciplines, and presumes some familiarity with the jargon and concepts.
I have written several versions of this over the years, for various classes; this one is from 18 February 2026, for a class which emphasized (Nadaraya-Watson) kernel smoothing, splines, and kernel density estimation. Posted now ~~for lack of other material~~ to refer to in the future, rather than re-writing yet again.

TL;DR: In statistics and machine learning, "kernel methods" refer to two different families of methods, one based on convolution, the other on hiding a basis expansion in the guise of a sum over data points. These use two different (but overlapping) sets of kernels. In both cases, the name "kernel" comes from some problems involving integrals in mathematical physics. (TL;DR of the TL;DR: Blame the mathematicians.)

"Kernel" is one of those terms which is used in many distinct-but-related senses across different areas of mathematics (like "normal"). The common metaphor is "the seed (in some sense) from which some larger object or structure grows (in some sense)". In particular, in physics, there are a lot of problems which involve integral operators $ \mathcal{I} $ that map one function, say $ f $, to a new function $ \mathcal{I} f $, by the relationship \[ (\mathcal{I}f)(x) = \int{f(z) K(x, z) dz} \] The inner function $ K(x,z) $ is called the kernel of the operator. (Or at least that's what it came to be called in English; a lot of this was first worked out in German, in the ~~1800s, and I don't know the original German technical term~~ very early 1900s, and the German word was Kern. [See update below.])

A particularly important class of integral operators, both in physics and in a lot of other fields, take the form \[ \int{f(z) G(x-z) dz} \] That is, the kernel isn't a two-argument function, of $ x $ and $ z $, but another one-argument function, only involving the difference between $ x $ and $ z $, say $ u = x-z $. This came to be called (in English) the convolution of the two functions $ f $ and $ G $. (The German original was Faltung; as late as 1933, an American mathematician writing about this uses that term because "there is no good English word" (*). I am told one would ordinarily translate this as something like "folding".) It was recognized a long time ago that if $ G(u) $ is a probability density function (pdf), then $ \int{f(z) G(x-z) dz} $ gives a weighted average of all the values of $ f $, with the weight given to $ f(z) $ depending on how close $ x $ is to $ z $. This implies that $ \overline{f}(x) = \int{f(z) G(x-z) dz} $ is a new function, related to $ f(x) $, but smoother, because $ \overline{f} $ is averaging out the oscillations and extremes of $ f $.

In the 1950s, the leading statisticians were, overwhelmingly, trained as mathematicians, and a lot of that mathematics was stuff which had grown out of mathematical physics, so a great deal of this mathematics was very familiar to them. (They were creating the graduate programs which would train statisticians as statisticians.) In particular, one cluster of statisticians who were doing (what we'd now call) signal processing were very interested in estimating the power spectra of radio (and other) signals --- how much energy was transmitted at each frequency. (This turns out to be very important for prediction and control.) The raw power spectra were extremely noisy, and the statisticians realized that by convolving those raw spectra with Gaussian (or other) kernels, they could get something much smoother and more stable, in effect trading some bias for a very large variance reduction. It was then quickly realized that the same idea could be used to estimate probability densities, convolving the empirical distribution with smooth kernels. And a few years after that, in 1964, Nadaraya and Watson (independently) realized that you could use this to do regression.

So the "kernel" in "kernel density estimation" and "kernel smoothing" is kernel in the sense of function-you-convolve-the-data-with.

There is another set of methods, also called "kernel methods", which look rather different. Go back to how I defined integral operators: \[ (\mathcal{I}f)(x) = \int{f(z) K(x,z) dz} \] This is a linear transformation on functions: $ \mathcal{I}(af + bg) = a\mathcal{I}f + b \mathcal{I}g $, for any scalars $ a, b $ and functions $ f, g $. Just as multiplying by a matrix is a linear transformation on vectors, integral operators are (one kind of) linear transformation of functions. Just as matrices have eigenvectors, integral operators have eigenfunctions, where \[ \mathcal{I}\phi = \lambda \phi \] for some scalar $ \lambda $, the eigenvalue. (Eigen- is German again; roughly "self-" or "own-".) For some integral operators, i.e., for some kernels $ K(x,z) $, the eigenfunctions $ \phi_1, \phi_2, \ldots $ actually form a basis, meaning that, for any (well-behaved) function $ f $ \[ f(x) = \sum_{i=1}^{\infty}{c_i \phi_i(x)} \] (Alternately, we can always define a space of functions as "everything we can get by taking linear combinations of these eigenfunctions".)

Writing arbitrary functions as weighted sums of basis functions is a very old trick in math. Doing regression in terms of some set of basis functions is almost as old as a trick in statistics. So the eigenfunctions of (nice) integral operators can be the basis functions we use for regression.

That doesn't single them out from any other set of basis functions, but, again in the 1950s, people realized that kernels which satisfy some properties (like symmetry and non-negativity) can themselves be expressed in terms of the eigenfunctions and eigenvalues: \[ K(x,z) = \sum_{i=1}^{\infty}{\lambda_i \phi_i(x) \phi_i(z)} \] The usual (and correct!) explanation is to think of mapping $ x $ to the vector of function values ("features") $ (\phi_1(x), \phi_2(x) , \ldots ) $; doing the same thing to $ z $; and then taking the (weighted) inner product between those vectors. As we say: "$ K(x,z) $ is an inner product in feature space". Any linear method you can write in terms of inner products thus has a "kernelized" equivalent. (For example.)

This also implies that a weighted sum of kernel functions is equivalent to a weighted sum of eigenfunctions: \[ \sum_{i=1}^{n}{a_i K(x, x_i)} = \sum_{j=1}^{\infty}{b_j \phi_j(x)} \] (EXERCISE: Find an expression for $ b_j $.)

So doing a weighted sum of kernel functions (in this sense) is equivalent to doing an infinite weighted sum of these eigenfunctions (which form a basis). In fact, a lot of the time we can show that optimal function of the form $ \sum_{j=1}^{\infty}{b_j \phi_j(x)} $ can in fact be written in the form $ \sum_{i=1}^{n}{a_i K(x, x_i)} $, so that we really only have a finite-dimensional optimization problem. (Such a result is called a "representer theorem".) This was important when people, like the statistician Grace Wahba, worked out the mathematical details of smoothing splines. (One can actually write out the kernel, in this sense, that's implicit in spline smoothing, but I do not find it very illuminating.) (**)

What came to be called "kernel methods" or "kernel machines", in the 1990s, were predictive models of the form $ \sum_{i=1}^{n}{a_i K(x, x_i)} $, where, again, $ K $ is one of those two-argument kernels which lead to nice eigenfunctions and eigenvalues when plugged into an integral operator. The goal wasn't really smoothing (as it was with the convolution methods), but to get the power of using a huge --- even an infinite! --- set of basis functions, without having to explicitly estimate a huge set of coefficients, or evaluate a huge set of functions when making predictions. People talked about "the kernel trick" as this way of using two-argument kernels to implicitly use vast function spaces, without explicitly calculating the functions.

There are some kernels, in this sense, which also work as kernels, in the smoothing sense (e.g., Gaussians). But the two sets of kernels are distinct, and the two sets of methods are really distinct. If you do kernel smoothing with a Gaussian kernel, there is just no way to write that as a sum of Gaussian kernels with fixed weights. In general, kernel smoothing regression give us predictions of the form \[ S_G(x) = \sum_{i=1}^{n}{\frac{G(x-x_i)}{\sum_{j=1}^{n}{G(x-x_j)}} y_i} \] when we've seen data points $ (x_1, y_1), \ldots (x_n, y_n) $. In contrast, kernel regression in the implicit-function-expansion sense gives us predictions of the form \[ R_K(x) = \sum_{i=1}^{n}{\alpha_i K(x, x_i)} \] It is an easy EXERCISE to show that if $ K(x, x_i) = G(x-x_i) $, and $ G(u) $ is a (non-degenerate) pdf, then there is no set of weights $ \alpha_1, \ldots \alpha_n $ which will make $ S_G(x) = R_G(x) $ for all $ x $. It is a harder EXERCISE to show that there is no combination of weights and two-argument kernel $ K $ which will make $ S_G(x) = R_K(x) $ for all $ x $.

If, back in the 1950s, the one line of work had talked about "convolutional smoothing" and the other "implicit basis function expansions" (or something like that), we would not have this confusion.

Update, 6 March 2026: On the other hand, at least the convolutional smoothing people did not give their technique a misleadingly psychological name...

Updates, 11 March 2025:

Consulting the German original of Courant and Hilbert shows that the German term was Kern --- or at least, that's what was by 1924. (I should, of course, have thought to check this source before, but I didn't realize it was online!) If I can trust some German-English dictionaries, the primary meaning is something like "the edible part inside a hard-shelled nut or the skin of a fruit", which is very similar to the most basic meaning of the English "kernel". Turning to their citations, Courant and Hilbert refer to a fundamental 1903 publication on integral equations (and so on integral transforms) by Fredholm. This paper does not (so far as I can tell) ever introduce a name for what we'd call the kernel, it's «la fonction $ f(x,y) $» throughout. (Certainly «noyau», the modern French term, does not appear in Fredholm's paper.) Courant and Hilbert also cite Maxime Boôcher's 1909 Introduction to the Theory of Integral Equations (written in English), which flatly states (p. 13) "$ K $ is called the kernel of these equations", and adds, in a footnote, that the term was "first employed by Hilbert" in 1904 (in German). So "kernel" would seem to have been fixed as the English translation of the German Kern (in this context) some time between 1904 and 1909.
The oldest appearance of "kernel" in the Annals of Mathematical Statistics is indeed Parzen (1961). By 1967, people proposing other modes of density estimation refer to the convolutional method as the "'kernel' method" or "'kernel' technique" (with the scare quotes), indicating that the term was in use but still novel.

*: Norbert Wiener, The Fourier Integral and Certain of Its Applications (Cambridge, England: Cambridge University Press, 1933), p. 45. ^

**: The oldest example I have run across of using kernel methods (in this sense) in statistics is Emanuel Parzen's work (1960/1963, 1961) on time series analysis, where it was motivated, in part, as a way around having to estimate power spectra; despite Parzen's eminence in statistics, this approach does not seem to have been much used, at least not then. (I'm sure it's no coincidence that Parzen was Wahba's doctoral adviser!) What makes this extra curious is that Parzen also wrote a 1962 paper where he (more-or-less) introduced convolutional kernel density estimation (in which he cites prior work on estimating power spectra), and that did take off. Someone with enough knowledge of mathematics and statistics, and access to his archives, could write a history-of-science paper on what led Parzen, in particular, to introduce both kinds of kernel methods into statistics. This would fascinate, oh, easily a dozen people other than myself. ^

Enigmas of Chance; Mathematics

Posted at February 27, 2026 11:46 | permanent link

Three-Toed Sloth

March 20, 2026

"Aware of All Internet Traditions: Generative AI as Information Retrieval and Synthesis" (Verbatim Remarks at the Cultural AI Workshop)

March 06, 2026

Search and Increasing Returns, or, No One Makes You Push to Github

Statistical Complexity of Link Prediction

February 27, 2026

How Statistics and Machine Learning Came To Have Two Different Kinds of Kernel Methods