Attention conservation notice: Over 3000 words of technical commentary on a paper on the statistical analysis of networks. Does a poor job of explaining things to those without background knowledge of networks, statistical inference and Markov chains. Includes some geeky jokes and many equations.

Yesterday in the Statistical Learning for Networks seminar, we discussed the following paper; what follows are a mix of my notes from before and after the discussion.

- Carsten Wiuf, Markus Brameier, Oskar Hagberg and Michael P. H. Stumpf, "A
likelihood approach to analysis of network
data", Proceedings of
the National Academy of Sciences (USA)
**103**(2006): 7566--7570 [Freely available from PNAS as over six months old] *Abstract*: Biological, sociological, and technological network data are often analyzed by using simple summary statistics, such as the observed degree distribution, and nonparametric bootstrap procedures to provide an adequate null distribution for testing hypotheses about the network. In this article we present a full-likelihood approach that allows us to estimate parameters for general models of network growth that can be expressed in terms of recursion relations. To handle larger networks we have developed an importance sampling scheme that allows us to approximate the likelihood and draw inference about the network and how it has been generated, estimate the parameters in the model, and perform parametric bootstrap analysis of network data. We illustrate the power of this approach by estimating growth parameters for the*Caenorhabditis elegans*protein interaction network.

It's perhaps not completely clear from the abstract that their method works
for a *particular* class of network growth models, which they call
"duplication-attachment" (DA) models. These are pure growth models, where the
network only expands, and never loses nodes or edges. (The network is assumed
to be undirected, without self-loops or multiple edges between a given pair of
nodes.) At each time step, we add one node. This is randomly chosen, with
fixed probability, to be either a duplication or an attachment. If it's an
attachment, the new node attaches to an old one, chosen uniformly over the
graph (possibly with some fixed probability < 1). If it's a duplication event,
we pick an existing node to duplicate, and the new one gets a certain
probability of copying each of its model's links (independently), and a
different probability of being linked to its model. It is entirely possible
that a new node is added with no links to any existing node. Notice that nodes
(and edges) have no intrinsic properties in this model; their names are
arbitrary. Any two isomorphic graphs should therefore be assigned the same
probability.

(The motivation for such models is that gene duplication is apparently fairly common, at least over evolutionary time, which would duplicate the interactions between genes, or their proteins. Attachment, here, is supposed to summarize all the other processes of network growth. There are several models of the evolution of protein interaction networks, e.g. those of Sole et al., 2002, and Vazquez et al., 2003, which are basically of the duplication-attachment type, and yield networks which at least qualitatively match some features of real-world examples, like degree distributions. These papers are not cited here.)

From any starting graph, it is easy to run the model forward and generate new, larger random graphs; the probabilities involved are all pretty simple uniform and binomial things. In fact, the current state of the graph completely determines the distribution of future graphs, so this is a Markov chain. The transition probabilities are fixed by the duplication and attachment parameters, collectively \( \theta \), and these, together with a distribution over starting graphs, give us a fully-specified stochastic model.

Normally, statistical inference for Markov chains is fairly straightforward,
because most of the classical conveniences which make inference for IID data
tractable are still available. (This is, after all, what led Markov to his
chains!) So why then does the paper not end on the second page, with a
citation "See Billingsley (1961)"? Because normally, when we observe a Markov
chain, we observe a *sequence* of values from the chain, and that lets
us estimate the transition parameters. Here, however, we have *only*
the end-state, the final graph, as our data. The Markov chain will let
us assign likelihoods to paths, but we don't know where we started from, and
we don't know where we went from there, just how we ended up here.

Suppose we did know where we started from, some graph of
size \( t_0 \),
\( G_{t_0} \). (Remember here that each step of the chain adds one node,
so \( t \) really counts nodes, not clock-time. This is why it's natural to
start at \( t_0 \), as opposed to 0 or 1. The paper does not seem to
explain this point.) If we knew our initial state, then in principle we could
figure out the probability of reaching our final state from it, as a sum over
all possible paths:
\[
L(G_t,\theta) \equiv \mathrm{Pr}(G_t|G_0;\theta) = \sum_{G_{t_0 +1}, \ldots G_{t-1}, G_{t} \in \mathcal{G}(t_0,t)}{\prod_{s={t_0+1}}^{t}{\mathrm{Pr}(G_s|G_{s-1};\theta)}}
\]
where \( \mathcal{G}(t_0,t) \) is set of all growing sequences of graphs which
start at \( G_{t_0} \) and end at \( G_t \). This mathematical expression is a
mouthful, admittedly, but it's probably clearer in a picture.

There are only so many paths from the initial graph \( G_{t_0} \) to the final, observed graph \( G_t \). The chain tells us the probability of each such path. Since we had to take one, and only one, of these paths, the total probability of making the journey is the sum of the probabilities of all the individual paths.

At this point, any physicists in the audience should be nodding their heads; what I've just said is that the likelihood, from a given starting configuration, is a sum over histories or sum over paths. Along with the authors, I'll return to how to evaluate this sum over histories presently, but first we need to figure out how to get that starting configuration.

If we had a known distribution over starting graphs, we could (in principle) just evaluate the likelihood conditional on each starting graph, and then take a weighted sum over graphs. This, however, is not what the authors do. (I'm really not sure where one would find such a distribution, other than another model for graph development. Bayesian practice would suggest picking something which led to easy computations, but this makes a mockery of any pretense to either modeling nature, or to representing incomplete prior knowledge.) Instead, they try to use the known dynamics of the DA model to fix on an unambiguous starting point, and do everything conditional on that.

They observe that you can take any graph, and, for each node, identify the
other nodes it could have been copied from. (If A could have been copied
directly from B, then A's neighbors must be a subset of B's [ignoring the link
between A and B, if any].) So, from any starting graph, you can recursively
remove nodes that could have arise through simple duplications. In general, at
each stage in this recursion there will be multiple nodes which could be
removed, and their choice is arbitrary. Remarkably enough, no
matter *which* choices one makes, the recursion always terminates at
the *same* graph. (More exactly, any two end-points are isomorphic to
each other, and so identical for statistical purposes.) The proof is basically
a fixed point theorem about a partial order defined on graphs through
deletion-of-duplicates, but they confine it to the supplementary
materials, so you can take it on trust (and they don't use
such lattice-theoretic language even there). This graph --- the
data, minus everything that could be pure duplication --- is what they take as
their starting point. This is the \( G_{t_0} \) to the data's \( G_t \).
Everything is then done conditional on \( G_{t_0} \).

OK, we have our initial condition and our final condition, and we have our Markov chain, so all we've got to do so is evaluate the sum over histories linking the two.

*Problem*: There are too many paths. In the worst case, the number
of paths is going to grow *factorially* with the number of nodes in the
observed graph. Even though along each path we've just got to do some
straight-forward multiplication, simply enumerating all the paths and summing
over them will take us forever. (The authors discuss some algorithmic tricks
for speeding up the exact calculation, but still get something
super-exponential!) Thus, evaluating the sum over histories for the likelihood
is intractable, even for a single parameter value.

*Solution*: Don't look at all the paths. Rather, *sample*
some paths, say \( N \) of them, evaluate the likelihood along each, and
average. Hope that this converges quickly (in \( N \) ) to the exact value of
the sum. This is, after all, how physicists approach many sums over histories.

*Problem*: Even if \( N \)> is fairly small, we need to examine
many settings of the parameter \( \theta \). It could still kill us to
have to sample \( N \) distinct paths for each parameter value.

*Solution*: Use importance sampling. Draw a *single* path,
valid for all parameter values, and evaluate the likelihood in terms of the
value of an "importance weight" along this path. The weight has to be a
function of \( \theta \), but it should be the only thing which is. We do this
here by writing the likelihood, \( L(G_t,\theta) \), as an expectation with
respect to a reference measure, which the authors write \( \theta_0 \). This
reference measure is given by another Markov chain, called the "driving chain";
despite its name, it is *not* a member of the DA family of chains. The
trick here is that *one* sample of possible paths, generated according
to this chain, can be used to (approximately) evaluate the likelihood
at *all* parameter settings of the DA model.

The crucial equation is [3] on p. 7568 \[ L(\theta,G_t) = \mathbf{E}_{\theta_0}\left[ \prod_{s=t_0}^{t}S(\theta_0,\theta,G_s,\nu)\right] \] where \[ S(\theta_0,\theta,G_t,\nu) = \frac{1}{t} \omega(\theta,G_t,\nu) \frac{\omega(\theta_0,G_t)}{\omega(\theta_0,G_t,\nu)} \] (N.B., the paper writes the second factor as \( \omega(\nu,G_t,\nu) \), but this is wrong.) Let's unpack this a bit.

\( \omega(\theta,G_t,\nu) \) is the probability of producing the graph \(
G_t \) through the addition of the node \( \nu \), with parameter setting \(
\theta \). (N.B., \( \nu \) must be a "removable" node, one which could have
been added by duplication.) \( \omega(\theta,G_t) \) is this transition
probability, summed over all possible \( \nu \). The first two factors
in *S* are what we want, the probability we'd get moving forward along
the path according to the parameter \( \theta \). The third term is the
reciprocal of the transition probabilities according to the driving chain. Its
only job is to cancel those probabilities out.

The algorithm for generating the *i*th sample path is then as
follows. Start with the observed graph \( G_t \). Count backwards, \( s = t,
t-1, \ldots t_0+1 \) Pick a node \( \nu^{(i)}_s \) to delete, with probability
proportional to \( \omega(\theta_0, G^{(i)}_s, \nu^{(i)}_s) \). (Once again,
this limits us to the "removable" nodes, the ones which could have been
produced by duplication.) Set \( G^{(i)}_{s-1} \) to be the result of deleting
that node. Keep going back until we hit the irreducible core, \( G_{t_0} \).
(We will always hit this core, by the fixed point theorem proved in the
supplementary results.) Then
\[
\begin{eqnarray*}
l^{(i)}(\theta) & = & \prod_{s=t_0}^{t}{S(\theta_0,\theta,G^{(i)}_{s},\nu^{(i)}_s)}\\
\hat{L}(\theta,G_t) & = & \frac{1}{N}\sum_{i=1}^{N}{l^{(i)}(\theta)}
\end{eqnarray*}
\]
To ease the calculation of this for multiple parameter settings, it may be
worth noting (though the authors do not) that
\[
l^{(i)}(\theta) = \left(\frac{(t_0-1)!}{t!}\right)\left(\prod_{s=t_0}^{t}{\omega(\theta,G^{(i)}_s,\nu^{(i)}_s)} \right)\left(\prod_{s=t_0}^{t}{\frac{\omega(\theta_0,G^{(i)}_s)}{\omega(\theta_0,G^{(i)}_s,\nu^{(i)}_s)}}\right)
\]
and the middle factor is the only one which depends on \( \theta \).

So, to summarize: We can generate *a* sample of paths connecting the
observed final graph to the unobserved initial graph, according to the driving
chain, and then approximate the likelihood for *any* parameter value by
multiplying the importance weights along those paths and summing over paths.
(The importance weights themselves even factor nicely.) We have thus solved
the problem of evaluating a sum over histories, when we've got only one end of
too many possible paths.

The trick used here to pull this off depended on having a uniquely-defined starting point for all parameter settings, namely the \( G_{t_0} \) defined through undoing duplications. (According to the authors, they took this from papers on the coalescent process in population genetics, but I have not been able to track down their references.) Strictly speaking, everything is conditional on that starting point. Of this, more below.

Left unaddressed by the above is the question of how many paths we need to
sample. Remember, the whole point is to *not* have to look at every
possible path! If it turns out that accurate approximations to the likelihood
require us to sample some substantial fraction, then this is all for nothing.
However, the authors' figures reveal something quite remarkable. Whether
\( N \) is 10 or 1000, the approximate likelihood changes very little (at
least on a log scale), even with real data. This suggests that we don't,
actually, need a lot of paths, but why?

For each *i*, \( l^{(i)}(\theta) \) is an independent
realization of a random variable, whose distribution depends
only \( \theta \) (holding fixed the driving chain, and the initial and
final graphs). Since they are IID, we can apply the central limit theorem,
which tells us that their mean should converge at rate \( 1/\sqrt{N} \).
Since that's noticeably smaller for N=1000 than for N=10, it must be the case
that the variance of the likelihood along the individual sample paths is
already pretty small. Why?

The lame physics answer is, "the principle of least action". There will be
optimal, most-probable paths, and they will dominate the sum, the others
tending to make negligible contributions. With high probability, a random sample
will pick out the most probable paths. Q.E.D. This argument could, perhaps,
be made less lame through an application of large deviations results,
specifically conditional-limit-theorem- (or "Gibbs's conditioning principle"-)
type results for Markov chains, which roughly say that if something improbable
(a passage from \( G_{t_0} \) to \( G_t \)) happens, it does so in
the least-improbable possible way, and deviations from that trajectory are
exponentially suppressed. <televangelist>In the name of Cramér,
in the name of Varadhan, in the name of Freidlin and Wentzell, I call on you,
Brother Argument, to be *healed!* Arise and prove! And, cf. Eyink
(2000).</televangelist>

An information-theoretic answer is to invoke the asymptotic equipartition principle, a.k.a. the Shannon-McMillan-Breiman theorem. This says that if \( X_1, X_2, \ldots X_t \) are generated according to a (well-behaved) stochastic process, whose distribution is \( \mu \), and \( \theta \) is a sufficiently well-behaved model, then \[ -\frac{1}{t}\log{L(\theta,X_1^t)} \rightarrow h(\mu) + d(\mu,\theta), ~\mu-a.s. \] where \( h(\mu) \) is the entropy rate of the data-generating process \( \mu \), and \( d(\mu,\theta) \) is the relative entropy rate between the data source and the model, i.e., the asymptotic growth rate of the Kullback-Leibler divergence. (For details, see Algoet and Cover, 1988, or Gray, 1990). So \[ \log{L(\theta)} = - (h+d(\theta))t + o(t), ~\mu-a.s. \] In words, there are only two kinds of sample long sample paths: those which aren't generated at all, and those where the normalized log-likelihood are equal, at least to first order. It's not clear to me here whether \( t \) is big enough in the examples for this effect to kick in.

The biggest unclarity of all, probably, is the role of \( G_{t_0} \).
Recall that we reached this by removing nodes which *could have been*
added by pure duplication. There is, however, no particular reason to think
that the actual growth of the graph ever passed through this state. It has the
advantage of giving us a unique starting point for the chain, but there are,
potentially, others. One, of course, is the trivial network consisting of a
single node! Another possibility (which came up in the discussion, I think
mostly due to Anna Goldenberg) is to
first remove potential duplicates, as the authors do, and then remove nodes
which have only one link to them, as clear attachments. This process of
unwinding the attachments could potentially be iterated, until no "danglers"
are left. This, too, is a uniquely-defined point. We can then go back to
removing nodes which are, now, potential duplicates, and so on. Someone (I
forget who) suggested that this might always terminate at the one-node network;
it would be nice to either show this or give a counter-example. But if there
is some principled reason, other than tractability, to use
their \( G_{t_0} \), I can't figure out what it is from this paper.

Only using a growing network, and in particular only focusing on growth
through duplication, certainly simplifies the computational problem, by
reducing the number of possible paths which could terminate in the observed
graph. Deletion of nodes and edges is however going to be very important in
more biologically-plausible models, to say nothing of models of social or
technological networks. Presumably the trick of using a backward-looking chain
which stops when it hits a unique starting configuration could still be used
with deletions --- I think the authors are hinting as much in their
conclusion --- but it's not clear to me that a unique starting
point *is* appropriate. With biological interaction networks, for
example, one might argue that, e.g., metazoans have been around for a long
time, so the distribution of networks ought to be close to stationary, and
so starting configurations should be drawn from an invariant distribution
of the appropriate chain...

This raises two further points, which are not un-related: the asymptotics of
the DA model, and the biological utility of such a model. Run for a long time,
the DA model will produce graphs of unbounded size, but it's not immediately
obvious what these graphs will look like. In particular, what will be their
degree distribution? The Barabasi-Albert model (Albert and Barabasi, 2002)
produces scale-free distributions, <boosterism>because it uses the same
mechanism as Herbert Simon's classic paper</boosterism> (Bornholdt and
Ebel, 2001). This relies on a rich-get-richer dynamic, where nodes with high
degree are more likely to attract new edges. My initial thought was that this
wasn't present in the DA model, because targets for attachment and targets for
duplication are both chosen uniformly. However, someone in the discussion ---
I think, though I may be mis-remembering, that it was Tanzy Love --- pointed
out that while high-degree nodes are no more likely to be copied than
low-degree nodes, *edges* into high-degree nodes are more likely to be
copied than edges into low-degree nodes. This is because if a node has
degree *k*, there are *k* other nodes whose duplicated could end
up linking to it. It may even be the case that this is falls under the
theorems in Simon... Presumably the asymptotics would only become harder to
handle if we added events deleting nodes or edges.

As for the biological utility, I'll repeat that none of the nodes have any
identity of their own; only their role in the network of relations represented
by the edges has any bearing on the model. "If this be structuralism, make the
most of it": by turning it into a neutral model for the evolution of biological
networks. After all, there is no *reason* here to duplicate certain
nodes or edges, it's all just uniform chance. One key use of neutral models is
to provide a background against which to detect adaptation; how could we do
that here?

- References:
- Albert, Réka and Albert-László Barabási (2002),
"Statistical Mechanics of Networks", Reviews of Modern Physics
**74**(2002): 47--97 = cond-mat/0106096 - Algoet, Paul H., and Thomas M. Cover (1988), "A Sandwich Proof of the
Shannon-McMillan-Breiman Theorem", The Annals of
Probability
**16**: 899--909 - Billingsley, Patrick (1961). Statistical Inference for Markov Processes. Statistical Research Monographs, vol. 2. Chicago: University of Chicago Press.
- Bornholdt, Stefan and Holger Ebel (2001), "World-Wide Web scaling exponent
from Simon's 1955 model", Physical Review E
**64**: 035104 = cond-mat/0008465 - Eyink, Gregory L. (2000), "A Variational Formulation of Optimal Nonlinear Estimation", Methodology and Computing in Applied Probability (submitted) = physics/0011049
- Gray, Robert M. (1990), Entropy and Information Theory. Berlin: Springer-Verlag. Full text free online.
- Simon, Herbert A. (1955), "On a Class of Skew Distribution Functions",
Biometrika
**42**: 425--440 - Solé Ricard V., Romualdo Pastor-Satorras, Eric Smith and Thomas
B. Kepler (2002), "A model of large-scale proteome evolution", Advances
in Complex Systems
**5**(2002): 43--54 = cond-mat/0207311 - Vázquez, A. and A. Flammini and A. Maritan and A. Vespignani (2003),
"Modeling of protein interaction networks", Complexus
**1**: 38--44 = cond-mat/0108043

**Update**, 18 January 2019: Fixed some typos, replaced
potentially-misleading use of "path integral" with "sum over histories";
left in even the worst jokes as a reminder-to-self.

Posted at November 29, 2006 02:30 | permanent link