2026 03

Search and Increasing Returns, or, No One Makes You Push to Github

Attention conservation notice: An economistic argument that computer networks are nigh-doomed to effective centralization, by someone who is neither an economist nor a computer scientist. Arcane, speculative, and not actionable even if correct. You would be better off spending your time reading a book.
Drafted in 2018, and deliberately not much updated. Posted now because I found myself re-using the joke of the subtitle in an e-mail.

Twitter is awful for many reasons, but not the least of them is the way it makes its users feel forced to keep using it. (One such complaint among many.) Early visions of how it could contribute to a beneficient or at least harmless ecosystem, such Steven Berlin Johnson's (*), presumed that it would be much less sticky, a more old-fashioned website people could leave. Now, more recently Johnson has written a spectacularly wrong-headed piece hailing blockchain as a blow for re-decentralizing the Web.

Of course, the primary working example of a block chain is Bitcoin, which is heavily centralized in holdings, in processing power, and in exchanges. And there are economies-of-scale reasons to expect this would be true of anything using either proof of work or proof of stake. But suppose the problem with Bitcoin is just that it's not got a very compelling use:

imagine if keeping your car idling 24/7 produced solved Sudokus you could trade for heroin --- @Theophite, 16 August 2018

and that nobody has a use for blockchains, not really.

Git repositories, on the other hand, are the part of the blockchain idea that's actually good for something: a way of tracking changes, with many authors, where it is very, very hard to go back and alter history without being caught. And git repositories are things that are in-principle easy to copy and move around, and could be hosted on any machine running HTTP (or, heck, FTP). So why does Github exist, let alone hold the position it does?

The answer, I think, comes in two parts. The first is that focal points are subject to increasing returns via network effects. The second is that search engines create and/or amplify focal points. Put these two together, and you get a very strong tendency for any particular line of activity --- be it trading solved sudokus for heroin or open-source software development --- to concentrate in just one, or at most a few, online locations.

Incidentally: This doesn't necessarily lock in first-mover advantage, because other forces can overcome this (revulsion, technical superiority), but it does mean that it will be very hard to avoid having at most a few dominant locations at any one time. It also means that the transitions between dominant locations will be brief.

If we really wanted to re-decentralize, we'd have to (1) get rid of focal points, or (2) get rid of search engines as we've known them, or (3) somehow make search point to a distributed, decentralized focal "point" (focal blob? focal rhizome?). I therefore strongly suspect we are not going to re-decentralize. But then I didn't expect the Web would centralize as much as it has, despite having literally learned my Brian Arthur and Paul David at my father's knee, so what do I know?

*: To be clear, I have been a fan of Johnson's books since Interface Culture and Emergence. I am picking on two of his essays, but that's because I think he is too ready to find encouraging signs for a certain vision of how the Internet could transform the world for the better. To be clear, I share the vision, but now hold out little hope for it.

The Dismal Science; Networks; Linkage

Posted at March 06, 2026 22:50 | permanent link

Statistical Complexity of Link Prediction

Attention conservation notice: Link to a two-page fragment of a mathematical paper that was abandoned over two decades ago.
I had not one, but two mentors in graduate school who, when approached with an idea or a question, were apt to go to their filing cabinet and pull out notes, from years or even decades back, which addressed that very issue, or one very close to it, and which they had never gotten around to finishing. (G. and C. had never met, and otherwise had little in common.) As a juvenile scientist, I found this both humbling and intensely irritating: why didn't they publish? As a middle-aged professor with a big directory of partially-finished projects, I have more sympathy, but still want to do better myself. I am therefore going to try to make a point of posting a lot of my fragments, and just ask that if someone decides to build on one, they put me in the acknowledgments.

In 2004, because I was thinking a lot about the statistical complexity of predicting stochastic processes, and learning about networks from Mark, Cris and Aaron, I tried my hand at defining statistical complexity for link prediction. The resulting complexity measure seemed straightforward-in-principle but too hard to calculate for anything very interesting (except maybe exponential-family random graphs, where it ends up being the entropy of increments to the minimal sufficient statistics).

In the ensuing 22 years, I have done literally nothing with the idea, but something reminded me of it the other day, so here's the two-page fragment I abandoned in March 2004.

The one thing I would add to the fragment is to consider a stochastic block model, where each node $ i $ has a latent discrete random variable $ X_i $, IIDly across nodes; $ X_i $ says which "block" (or "community" or "module", etc.) node $ i $ lives in. The probability of an edge between nodes $ i $ and $ j $ is a function of $ X_i $ and $ X_j $ alone, independent of all other dyads or anything else. The pair $ (X_i, X_j) $ is thus the "state" which fixes the distribution of the dyad. Of course, as pair of latent variables, this is not a statistic, a function of the observable graph $ G $. But there are many circumstances where, as we see larger and larger graphs, we can infer all the $ X_i $ from $ G $, with the probability of making any errors tending to zero. (Ed McFowland and I tried to summarize those conditions in our paper, because we wanted to use them as tools for something else.) In these situations, the sufficient statistic for predicting $ G_{ij} $ from the rest of the graph will in fact tend towards $ (X_i, X_j) $ as $ n \rightarrow \infty $, and so the limiting statistical forecasting complexity will be at most $ 2 H[X_i] $ (since $ X_i $ and $ X_j $ are IID). I say "at most" because there could be situations where distinct pairs of blocks have the same edge probability; if that's ruled out, the asymptotic statistical complexity will indeed be twice the entropy of the block variable for one node.

The same argument could extend to any graphon, if the node variables can be asymptotically recovered from the observed graph.

Enigmas of Chance; Networks; Complexity

Posted at March 06, 2026 21:09 | permanent link

Three-Toed Sloth

March 06, 2026

Search and Increasing Returns, or, No One Makes You Push to Github

Statistical Complexity of Link Prediction