Statistical Complexity of Link Prediction

Attention conservation notice: Link to a two-page fragment of a mathematical paper that was abandoned over two decades ago.
I had not one, but two mentors in graduate school who, when approached with an idea or a question, were apt to go to their filing cabinet and pull out notes, from years or even decades back, which addressed that very issue, or one very close to it, and which they had never gotten around to finishing. (G. and C. had never met, and otherwise had little in common.) As a juvenile scientist, I found this both humbling and intensely irritating: why didn't they publish? As a middle-aged professor with a big directory of partially-finished projects, I have more sympathy, but still want to do better myself. I am therefore going to try to make a point of posting a lot of my fragments, and just ask that if someone decides to build on one, they put me in the acknowledgments.

In 2004, because I was thinking a lot about the statistical complexity of predicting stochastic processes, and learning about networks from Mark, Cris and Aaron, I tried my hand at defining statistical complexity for link prediction. The resulting complexity measure seemed straightforward-in-principle but too hard to calculate for anything very interesting (except maybe exponential-family random graphs, where it ends up being the entropy of increments to the minimal sufficient statistics).

In the ensuing 22 years, I have done literally nothing with the idea, but something reminded me of it the other day, so here's the two-page fragment I abandoned in March 2004.

The one thing I would add to the fragment is to consider a stochastic block model, where each node $ i $ has a latent discrete random variable $ X_i $, IIDly across nodes; $ X_i $ says which "block" (or "community" or "module", etc.) node $ i $ lives in. The probability of an edge between nodes $ i $ and $ j $ is a function of $ X_i $ and $ X_j $ alone, independent of all other dyads or anything else. The pair $ (X_i, X_j) $ is thus the "state" which fixes the distribution of the dyad. Of course, as pair of latent variables, this is not a statistic, a function of the observable graph $ G $. But there are many circumstances where, as we see larger and larger graphs, we can infer all the $ X_i $ from $ G $, with the probability of making any errors tending to zero. (Ed McFowland and I tried to summarize those conditions in our paper, because we wanted to use them as tools for something else.) In these situations, the sufficient statistic for predicting $ G_{ij} $ from the rest of the graph will in fact tend towards $ (X_i, X_j) $ as $ n \rightarrow \infty $, and so the limiting statistical forecasting complexity will be at most $ 2 H[X_i] $ (since $ X_i $ and $ X_j $ are IID). I say "at most" because there could be situations where distinct pairs of blocks have the same edge probability; if that's ruled out, the asymptotic statistical complexity will indeed be twice the entropy of the block variable for one node.

The same argument could extend to any graphon, if the node variables can be asymptotically recovered from the observed graph.

Enigmas of Chance; Networks; Complexity

Posted at March 06, 2026 21:06 | permanent link

Three-Toed Sloth

March 06, 2026

Statistical Complexity of Link Prediction