Recovering Networks from Noisy Observations
29 Sep 2024 15:50
First version 19 November 2021 (as an e-mail to colleagues after a seminar); prepared for posting 29 September 2024 because I ran across it again cleaning stuff up, and had done nothing with it in the meanwhileSuppose $G^{(n)}$ is a simple undirected graph on $n$ nodes, generated from some well-behaved network model, for simplicity here a stochastic block model. As I explain below, I think the argument will carry over to any conditionally-independent dyad/graphon network model, so long as the edge probabilities are bounded away from 0 and 1, but the jargon is a little easier if I talk about block models.
We observe not $G^{(n)}$ but $A^{(n)}$, where each dyad in $G^{(n)}$ is corrupted by independent noise; whether this takes the form of flipping edge/non-edge with constant probability, or distinct false-positive and false-negative rates, does not I think matter.
- Query: When can $G^{(n)}$ be recovered from $A^{(n)}$?
- Conjecture: Never, even if $n \rightarrow \infty$, so long as the level of observational noise is constant in $n$, and $0 < P(G_{ij}=1|X_i=r, X_j=s) < 1$ for all $r, s$.
First argument
Let us focus on a single dyad, \( G_{ij} \). (Strictly speaking this should be $G^{(n)}_{ij}$, but you knew that.) Draw the graphical model here. $ A_{ij} $ has one parent, $ G_{ij} $. $ G_{ij} $ has two parents, $ X_i $ and $ X_j $, the (latent) block assignments of the nodes. Conditional on its parents, $ G_{ij} $ is independent of its non-descendants, and conditioning on $ A_{ij} $ as well doesn't activate any additional paths. So $ G_{ij} $ is independent of all other $ A_{kl} $, given $ X_i, X_j, A_{ij} $. But the entropy of $ G_{ij} $ given those three variables is plainly going to be $> 0$, and be constant in $n$: \[ H[G_{ij}|A^{(n)}, X_i, X_j] = H[G_{ij}|A_{ij}, X_i, X_j] > 0 \]So even if conditioning on all of $A$ gave us, asymptotically, perfect knowledge of $ X_i $ and $ X_j $, we still couldn't recover $ G_{ij} $. More exactly, our conditional entropy for $ G_{ij} $ given $ A^{(n)} $ will tend to a constant $> 0$: \[ H[G_{ij}|A^{(n)}] \rightarrow H[G_{ij}|A_{ij}, X_i, X_j] > 0 \] (Note by the way that this is under the optimistic, but not crazy, assumption that seeing a large but noisy graph asymptotically lets us pick out each node's block.) The constant will depend on $ X_i$ , $ X_j $ and $ A_{ij} $, say \[ H[G_{ij}|A_{ij} = a, X_i = r, X_j=s] = c(a, r, s) \]
Now, by Fano's inequality (e.g.), the probability of correctly classifying $ G_{ij} $ cannot approach zero, because conditional entropy implies a lower bound on the mis-classification probability.
This will be true for all dyads, and given the $X$s dyads are conditionally independent, so their entropies add: \[ H[G^{(n)}|A^{(n)}] = \sum_{i, j}{H[G_{ij}|A^{(n)}]} \rightarrow \sum_{(i,j)}{H[G_{ij}|A_{ij}, X_i, X_j]} \] This will in turn approach \[ n^2 \sum_{r, s}{\sum_{a=0}^{1}{c(a, r, s) P(A_{ij}=a|X_i=r, X_j=s) P(X_i=r, X_j=s)}} + O_P(n) \] because the independence assumptions built into the model give us a weak law of large numbers for the proportion of dyads belonging to any given pair of blocks.
In particular, $H[G^{(n)}]$ and $H[G^{(n)}|A^{(n)}]$ both grow $\simeq n^2$, though the latter will be smaller than the former. So it's hard to see any sense in which we could be consistently recovering $G^{(n)}$ from $A^{(n)}$.
Second argument
Let's go back to our favorite dyad $(i, j)$. Recovering the state of this dyad would mean that the posterior log-odds ratio for $ G_{ij}=1 $ vs. $ G_{ij}=0 $, conditional on $A^{(n)}$, would have to go to $\pm \infty$. (In fact it would entail going to the right infinity, but we'll see that it doesn't go to either one at all.) Now the posterior log odds are equal to the prior log odds plus the log likelihood ratio. So suppose that the Oracle told us the block assignments of nodes $i$ and $j$, and the structural probability of an edge between those blocks. This gives us the prior log odds, and they're finite, whatever they might be. (Again, I assume none of the block-block edge probabilities are zero or one.) Moreover the log likelihood ratio for edge/no edge is set by the observational noise levels, and those don't change as we get larger and larger data sets, because of the independent-noise assumption. Hence the posterior log odds can grow in magnitude but only up to a finite value, meaning we never approach any firm decision about whether or not a given edge exists.Of course we don't have the Oracle, so we don't really know the true block assignments or structural parameters, but that could only make things worse, and anyway if the Inference Gods are kind we'll asymptotically approach the Oracle.
Remark
This argument implies a limitation on link prediction, even if edges are observed without noise.Escaping this
- As I've said, I don't see how switching from a block model to any other graphon would make any difference, so long as the graphon bounds the edge probabilities between 0 and 1.
- We could drop the graphon assumption. For example, in an ERGM with non-dyadic terms, we might get more information about $ G_{ij} $ from pinning down other dyads. Of course, the asymptotics here get tricky, to put it kindly. In terms of the second argument, this might let us send the prior log-odds for or against a given dyad towards infinity.
- We might drop the independent noise assumption: $G^{(n)}$ would still be generated by a graphon, but $A^{(n)}$ wouldn't be, perhaps because of correlated noise among the different dyads involving node $i$. I don't immediately see how this would help but it might let the log-likelihood go towards infinity. Even then, however, if each dyad is subject to even a little bit of noise that's independent of what's going on in other dyads, and the magnitude of that noise doesn't go to zero as $n$ grows, we seem to be in for a version of the same trouble.