Notebooks

## Graph Limits and Infinite Exchangeable Arrays

27 Aug 2019 12:39

An exchangeable random sequence $X$ is a sequence of random variables, $X_1, X_2, \ldots$, whose distribution is invariant under permutation of the indices. All such sequences are formed by taking mixtures of independent and identically-distributed (IID) sequences. (See Exchangeable Random Sequences.) An exchangeable random array, $G$, is simply a matrix or array of random variables $G_{ij}$ whose distribution is invariant under permutation of row and column indices ($i$ and $j$). I mostly care about what are sometimes called jointly exchangeable arrays, where the same permutation is applied to the rows and the columns. If we can apply different permutations to rows and to columns and still get invariance in distribution, then the process is separately exchangeable; this however does not interest me so much, for reasons which I hope will be clear in a moment.

As I said, infinite-dimensional exchangeable distributions are all formed by taking mixtures of certain basic or extremal distributions, which are the infinite-dimensional IID distributions. To generate an exchangeable sequence, one first randomly draws an probability law from some prior distribution, and then draws from that law independently until the end of time. (Again, see Exchangeable Random Sequences.) Is there an analogous set of extremal distributions for exchangeable arrays? Well, yes, or else I wouldn't have asked the question...

It's easiest to understand what's going on if we restrict ourselves to binary arrays, so $G_{ij}$ must be either 0 or 1. One very important instance of this — or at least one I use a lot — makes $G$ the adjacency matrix (or "sociomatrix") of a network, with $G_{ij} = 1$ if there is a edge linking $i$ and $j$, and $G_{ij}=0$ otherwise.

For each $i$, draw an independent random number $U_i$ uniformly on the unit interval. Now, separately, fix a function $w(u,v)$ from the unit square ${[0,1]}^2$ to the unit interval $[0,1]$, with the symmetry $w(u,v) = w(v,u)$. Finally, set $G_{ij} = 1$ with probability $w(U_i, U_j)$, independently across dyads $ij$. Conditional on the $U_i$, all edges are now independent (though not identically distributed). Moreover, $G_{ij}$ and $G_{kl}$ are independent, unless the indices overlap. (However, $G_{ij}$ and $G_{kl}$ can be dependent given, say, $G_{jk}$.) But edges with nodes in common are not independent, nor are edges identically distributed, unless the function $w$ is constant almost everywhere. Call the resulting stochastic graph $G$ a $w$-random graph.

(Using the unit interval for the $U$ variables is inessential; if we have a measurable mapping $f$ from $[0,1]$ to any other space, with a measurable inverse $f^{-1}$, then set $V_i = f(U_i)$, and $\Pr{\left(G_{ij} = 1\right)} = w^{\prime}(V_i, V_j) = w(f^{-1}(U_i),f^{-1}(U_j)) ~.$ So if you really want to make the variable for each node a 7-dimensional Gaussian rather than a standard uniform, go ahead.)

What are some examples of $w$-random graphs? Well, as I said, setting $w$ to a constant, say $p$, does in fact force the edges to be IID, each edge being present with probability $p$, so the whole family of Erdos-Renyi random graphs, i.e., random graphs in the strict sense, is included. Beyond this, a simple possibility is to partition the unit interval into sub-intervals, and force $w$ to be constant on the rectangles we get by taking products of the sub-intervals. This corresponds exactly to what the sociologists call "stochastic block models", where each node belongs to a discrete type or block of nodes (= sub-interval), and the probability of an edge between $i$ and $j$ depends only on which blocks they are in. Community- or module- discovery in networks is mostly based on the assumption that not only is there some underlying block model, but that the probability of an intra- block connection is greater than that of an inter- block edge, no matter the blocks; that is, $w$ is peaked along the diagonal. Since every measurable function can be approximated arbitrarily-closely by piecewise-constant "simple functions", one can in fact conclude that every $w$-random graph can be approximated arbitrarily closely (in distribution) by a stochastic block model, though it might need a truly huge number of blocks to get an adequate approximation. This also gives an easy way to see that two different $w$ functions can give rise to the same distribution on graphs, so we'll ignore the difference between $w$ and $w^{\prime}$ if $w(u,v) = w^{\prime}(T(u), T(v))$, where $T$ is an invertible map from $[0,1]$ onto $[0,1]$ that preserves the length of intervals (i.e., preserves Lebesgue measure). The reason we ignore this difference is that $T$ just "relabels the nodes", without changing the distribution of graphs.

It's not hard to convince yourself that every $w$-random graph is exchangeable. (Remember that we see only the edges $G_{ij}$, and not the node-specific random variables $U_i$.) What is very hard to show, but is in fact true, is that the distribution of every infinite exchangeable random graph is a mixture of $w$-random graph distributions. Symbolically, the way to produce an infinite exchangeable graph is always to go through the recipe $\begin{eqnarray*} W & \sim & p\\ U_i|W & \sim_{\mathrm{IID}} & \mathcal{U}(0,1)\\ G_{ij}| W, U_i, U_j &\sim & \mathrm{Bernoulli}(W(U_i,U_j)) \end{eqnarray*}$ for some prior distribution $p$ over $w$-functions.

In the exchangeable-sequence case, if all we have is a single realization of the process, we cannot learn anything about the prior distribution over IID laws. (Similarly, if we have only a single realization of a stationary process, we can only learn about the one ergodic component that realization happens to be in, though in principle we can learn everything about it.) If we have only a single network to learn from, then we cannot learn anything about the prior distribution $p$, but we can learn about the particular $W$ that it generated, and that will let us extrapolate to other, currently-unseen parts of the network.

Here is where a very interesting connection comes in to what at first sight seems like a totally different set of ideas. Suppose I have a sequence of graphs $G^1, G^2, \ldots$, all of finite size. When can I say that this sequence of graphs is converging to a limit, and what kind of object is its limit?

Experience with analysis tells us that we would like converging objects to get more and more similar in their various properties, and one important set of properties for graphs is the appearance of specific sub-graphs, or motifs. For instance, when $G_{ij} = G_{jk} = G_{ki} = 1$, we say that $i,j,k$ form a triangle, and we are often interested in the number of triangles in $G$. More broadly, let $H$ be some graph with fewer nodes than $G$, and define $m(H,G)$ to be the number of ways of mapping $H$ onto $G$ --- picking out nodes in $G$ and identifying them with nodes in $H$ such that the nodes in $G$ have edges if and only if their counterpart nodes in $H$ have edges. (In a phrase, the number of homomorphisms from $H$ into $G$.) The maximum possible number of such mappings is limited by the number of nodes in the two graphs. The density of $H$ in $G$ is $t(H,G) \equiv \frac{m(H,G)}{{|G| \choose |H|}}$ If $H$ has more nodes than $G$, we define $m(H,G)$ and $t(H,G)$ to be 0. (Actually, there are a couple of different choices for defining the allowed mappings from $G$ to $H$, and so for the normalizing factor in the denominator of $t$, but these end up not making much difference.)

We can now at last define convergence of a graph sequence: $G^1, G^2, \ldots$ converge when, for each motif $H$, the density sequence $t(H,G^1), t(H,G^2), \ldots$ converges. There are several points to note about this definition:

1. If, after a certain point $n$, the graph sequence becomes constant, $G^n = G^{n+m}$, then the sequence converges. This is a reasonable sanity-check on our using the word "convergence" here.
2. A sequence of isomorphic graphs (i.e., ones which are the same after some re-labeling of the nodes) has already converged, since they all have the same density for every motif. So the definition of convergence is insensitive to isomorphisms. This is good, in a way, because isomorphic graphs really are the same in a natural sense, but bad, because deciding whether two graphs are isomorphic is computationally non-trivial, and may even be NP-complete.
3. If the sequence of graphs keep growing, then convergence of the sequence implies convergence not of the number of edges, triangles, four-stars, etc., but of their suitably-normalized densities.
4. The definition is strongly analogous to that of "convergence in distribution" (a.k.a. "weak convergence") in probability theory. A sequence of distributions $P^1, P^2, \ldots$, converges if and only if, for every bounded and continuous function $f$, the sequence of expected values $P^i f \equiv \int{f(x) dP^{i}(x)}$ converges. Densities of motifs act like bounded and continuous "test functions".
5. The limit of a sequence of graphs is not necessarily a graph. Analogously, the limit of a sequence of discrete probability distributions, like our empirical distribution at any $n$, is not necessarily discrete — it might be a distribution with a continuous density, a mixture of a continuous and a discrete part, etc. The people who developed the theory of such graph limits called the limiting objects graphons. Roughly speaking, graphons are to graphs as general probability distributions are to discrete ones.

How are graphons represented, if they are not graphs? Well, they turn out to be representable as symmetric functions from the unit square to the unit interval, i.e., $w$-functions! It is easy to see how to turn any finite graph's adjacency matrix into a $0-1$-valued $w$-function: divide the unit interval into $n$ equal segments, and make $w$ 0 or 1 on each square depending on whether the corresponding nodes had an edge or not. Call this $w_G$. It turns out, through an argument I do not feel up to even sketching today, that the density $t(H,G)$ can be expressed as an integral, which depends on $H$ with respect to $w$-function derived from $G$: $t(H,G) = \int_{[0,1]^{|H|}}{\prod_{(i,j)\in H}{w_{G}(u_i,u_j)} du_1 \ldots du_{|H|}}$ This carries over to the limit: if the sequence $G^n$ converges, then $\lim_{n\rightarrow\infty}{t(H,G^n)} = \int_{[0,1]^{|H|}}{\prod_{(i,j)\in H}{w(u_i,u_j)} du_1 \ldots du_{|H|}}$ for some limiting function $w$. (If you are the kind of person who finds the analogy to convergence in distribution helpful, you can fill in this part of the analogy now.) We identify the limiting object, the graphon, with the limiting $w$-function, or rather with the equivalence class of limiting $w$-functions.

To sum up: If we start with an infinite exchangeable graph distribution, then what gets realized comes from a (randomly-chosen) extremal distribution. But the limits of sequences of graphs are, precisely, the extremal distributions of the family of exchangeable graphs. So we would seem to have the kind of nice, closed circle which makes statistical inference possible: a sufficiently large realization becomes representative of the underlying process, which lets us infer that process by examining the realization. What I am very much interested in is how to actually use this suggestion to do some concrete, non-parametric statistics for networks. In particular, it would seem that understanding this would open the way to being able to smooth networks and/or bootstrap them, and either one of those would make me very happy.

Specific points of interest:

1. Understand how to metrize graph convergence, and efficiently calculate the metrics; use for tests of network difference.
2. Suppose that the sequence of graphs $G^n$ are sparse, so that the number of edges per node tends to a constant (rather than growing proportional to the number of nodes). Then all motif densities tend to zero and we lose the ability to distinguish between graph sequences. What is the best way of defining convergence of sparse graphs? What does this do to the probabilistic analogs of graphons?
3. How does this relate to the issues of projectibility for exponential-family random graph models?
4. Given a graph sequence, when can we consistently estimate the, or a, limiting $w$-function? Bickel, Chen and Levina (below) define a set of statistics whose expected values characterize the $w$-function and which can be consistently estimated. This was extremely clever, but inverting the mapping from $w$ to those expectations looks totally intractable — and indeed they don't even try. My own feeling is that this is more of a job for smoothing than for the method of moments, but I'm not comfortable saying much more, yet.
Recommended, big picture:
• Christian Borgs, Jennifer Chayes, László Lovász, Vera T. Sós, Balázs Szegedy and Katalin Vesztergombi, "Graph Limits and Parameter Testing", Proceedings of the 38th Annual {ACM} Symposium on the Theory of Computing [STOC 2006], pp. 261--270 [PDF reprint via Dr. Chayes]
• Christian Borgs, Jennifer T. Chayes, Henry Cohn, and Yufei Zhao, "An $L^p$ Theory of Sparse Graph Convergence I: Limits, Sparse Random Graph Models, and Power Law Distributions", arxiv:1401.2906
• Persi Diaconis and Svante Janson, "Graph Limits and Exchangeable Random Graphs", Rendiconti di Matematica e delle sue Applicazioni 28 (2008): 33--61, arxiv:0712.2749
• Olav Kallenberg, Probabilistic Symmetries and Invariance Principles [Chapter 7 has the best treatment of exchangeable arrays I've seen. The key results are due to Aldous and Hoover in the early 1980s, but their proofs are notoriously hard, and Kallenberg provided the first "natural", probabilistic proofs.]
• Steffen Lauritzen, "Harmonic Analysis of Symmetric Random Graphs", arxiv:1908.06456 [This is an alternative way of getting to graphons and graph limits, by exploiting a correspondence between exchangeable distributions and "characters" on Abelian semi-groups, i.e., functions which act like exponentials. From this point of view, graphons are a more natural (generalized) for networks than are exponential-family random graohs]
• Laszlo Lovasz, Large Networks and Graph Limits
• Laszlo Lovasz, Balazs Szegedy, "Limits of dense graph sequences", arxiv:math/0408173 [The original graph-limits paper. Note especially theorem 2.5, which shows that the probability of $t(H,G^n)$ being very different from the limiting value is exponentially small in $n$.]
• Steffen L. Lauritzen
• "Exchangeable Rasch Matrices", Rendiconti di Matematica e delle sue Applicazioni 28 (2008): 83--95 [PDF reprint via Prof. Lauritzen]
• "Exchangeable Matrices and Random Networks", [PDF slides; earlier lectures (1, 2) probably useful for context]
• Patrick J. Wolfe, Sofia C. Olhede, "Nonparametric graphon estimation", arxiv:1309.5936
Recommended, close-ups:
• Edoardo M. Airoldi, Thiago B. Costa, Stanley H. Chan, "Stochastic blockmodel approximation of a graphon: Theory and consistent estimation", arxiv:1311.1731
• David J.Aldous, "Representations for partially exchangeable arrays of random variables", Journal of Multivariate Analysis 11 (1981): 581--598
• Peter J. Bickel and Aiyou Chen, "A nonparametric view of network models and Newman-Girvan and other modularities", Proceedings of the National Academy of Sciences (USA) 106 (2009): 21068--21073 [This is the paper which introduced me, and many others in the network area, to the possibility of using graph-limit and exchangeable-array theory, but in retrospect it is by no means an easy read.]
• Peter J. Bickel, Aiyou Chen, and Elizaveta Levina, "The method of moments and degree distributions for network models", Annals of Statistics 39 (2011): 38--59, arxiv:1202.5101
• Christian Borgs, Jennifer Chayes and David Gamarnik, "Convergent sequences of sparse graphs: A large deviations approach", arxiv:1302.4615 [Defining the limit of a sequence of sparse graphs in terms of large deviations of random measures on them]
• Christian Borgs, Jennifer Chayes, László Lovász, Vera T. Sós and Katalin Vesztergombi, "Convergent Sequences of Dense Graphs I: Subgraph Frequencies, Metric Properties and Testing", Advances in Mathematics 219 (2008): 1801--1851 [PDF reprint via Dr. Chayes]
• Christian Borgs, Jennifer Chayes, László Lovász, Vera T. Sós and Katalin Vesztergombi, "Convergent Sequences of Dense Graphs II: Multiway Cuts and Statistical Physics" [PDF preprint via Dr. Borgs]
• Francois Caron, Emily B. Fox, "Bayesian nonparametric models of sparse and exchangeable random graphs", arxiv:1401.1137
• Stanley H. Chan, Edoardo M. Airoldi, "A Consistent Histogram Estimator for Exchangeable Graph Models", arxiv:1402.1888
• Sourav Chatterjee, Persi Diaconis and Allan Sly, "Random graphs with a given degree sequence", Annals of Applied Probability 21 (2011): 1400--1435, arxiv:1005.1136 [Interesting application of the new technology of graph limits to a classic model. May not be terribly practical yet but definitely promising.]
• Sourav Chatterjee and S. R. S. Varadhan, "The large deviation principle for the Erdos-Renyi random graph", arxiv:1008.1946 [Ditto]
• David S. Choi, Patrick J. Wolfe, "Co-clustering separately exchangeable network data", arxiv:1212.4093
• Olav Kallenberg
• J. F. C. Kingman, "Uses of Exchangeability", Annals of Probability 6 (1978): 183--197
• Steffen L. Lauritzen
• "Extreme Point Models in Statistics" (with discussion), Scandinavian Journal of Statistics 11 (1984): 65--91 [JSTOR]
• Extremal Families and Systems of Sufficient Statistics [Mini-review]
• James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani and Daniel M. Roy, "Random function priors for exchangeable arrays with applications to graphs and relational data", NIPS 2012
• M. E. J. Newman and Tiago P. Peixoto, "Generalized communities in networks", Physical Review Letters 115 (2015): 088701, arxiv:1505.07478
• Miklós Abért, Tamás Hubai, "Benjamini-Schramm convergence and the distribution of chromatic roots for sparse graphs", arxiv:1201.3861
• David Aldous, Russell Lyons, "Processes on Unimodular Random Networks", arxiv:math/0603062
• Tim Austin, Dmitry Panchenko, "A hierarchical version of the de Finetti and Aldous-Hoover representations", arxiv:1301.1259
• Itai Benjamini, Russell Lyons, Oded Schramm, "Unimodular Random Trees", arxiv:1207.1752
• Béla Bollobás, Svante Janson and Oliver Riordan, "The Phase Transition in Inhomogeneous Random Graphs"
• Béla Bollobás and Oliver Riordan, "Sprase graphs: metrics and random models", arxiv:0812.2656
• Marián Boguñá and Romualdo Pastor-Satorras, "Class of correlated random networks with hidden variables", Physical Review E 68 (2003): 036112, arxiv:cond-mat/0306072
• Christian Borgs, Jennifer T. Chayes, Souvik Dhara, Subhabrata Sen, "Limits of Sparse Configuration Models and Beyond: Graphexes and Multi-Graphexes", arxiv:1907.01605
• Christian Borgs, Jennifer T. Chayes, Henry Cohn, Shirshendu Ganguly, "Consistent nonparametric estimation for heavy-tailed sparse graphs", arxiv:1508.06675
• Christian Borgs, Jennifer T. Chayes, Henry Cohn, László Miklós Lovász, "Identifiability for graphexes and the weak kernel metric", arxiv:1804.03277
• Christian Borgs, Jennifer Chayes and László Lovász, "Moments of Two-Variable Functions and the Uniqueness of Graph Limits", Geometric and Functional Analysis 19 (2010): 1597--1619 [PDF preprint]
• Sourav Chatterjee, "Matrix estimation by Universal Singular Value Thresholding", arxiv:1212.1247
• Fan Chung, "From quasirandom graphs to graph limits and graphlets", arxiv:1203.2269
• Harry Crane, "Infinitely exchangeable random graphs generated from a Poisson point process on monotone sets and applications to cluster analysis for networks", arxiv:1110.4088
• Persi Diaconis, Susan Holmes and Svante Janson, "Threshold Graph Limits and Random Threshold Graphs", arxiv:0908.2448
• David Gamarnik, "Right-convergence of sparse random graphs", arxiv:1202.3123
• Tue Herlau, Mikkel N. Schmidt, Morten Mørup, "Completely random measures for modelling block-structured networks", arxiv:1507.02925
• Brian Karrer, M. E. J. Newman, "Random graphs containing arbitrary distributions of subgraphs", arxiv:1005.1659 [Not sure if this really connects or not...]
• P. Latouche, S. Robin, "Bayesian Model Averaging of Stochastic Block Models to Estimate the Graphon Function and Motif Frequencies in a W-graph Model", arxiv:1310.6150
• Laszlo Lovasz, "Very large graphs", arxiv:0902.0132
• Terence Tao, "A correspondence principle between (hyper)graph theory and probability theory, and the (hyper)graph removal lemma", arxiv:math/0602037
• Johan Ugander, Lars Backstrom, Jon Kleinberg, "Subgraph Frequencies: Mapping the Empirical and Extremal Geography of Large Graph Collections", arxiv:1304.1548
To write:
• CRS + co-conspirators to be named later, "Detecting Differences in Network Structure"
• Co-conspirators to be named later + CRS, "Smoothing Networks"