## Projectivity in Statistical Models

*25 Apr 2022 10:11*

Suppose we consider a sequence of statistical observations, where we keep gathering more and more data. (Perhaps we're running more and more replications of an experiment, or doing larger and larger surveys, or sequentially extending a time-series.) We'll get a sequence of sample spaces, where each one contains all the previous spaces, plus some new variables for the additional information. If we're at the $n^{\mathrm{th}}$ sample space, we can recover an earlier one, $m < n$, by just dropping the extra data. Mathematically, this amounts to "projecting" on to the first $m$ coordinates. Let's write $\pi_{n\mapsto m}$ for the function which does this projection. The inverse of this, $\pi_{n\mapsto m}^{-1}$, will be a set-valued function, i.e., $\pi_{n\mapsto m}^{-1}(a)$ will consist of all size-$n$ data values which would be mapped down to $a$ when we just look at their first $m$ coordinates.

Suppose also that we have a sequence of probability distributions, one for
each sample space, say $ P_n $ for the $n^{\mathrm{th}}$ space. Then we say the
distributions are **projective**, or **form a projective
family**, when, for any set $A$,
\[
P_m(X_m \in A) = P_n(X_n \in \pi_{n\mapsto m}^{-1}(A))
\]
We also write this as $ P_m = \pi_{n \mapsto m} P_n$.

(If you wonder about sample spaces which aren't in a sequence, or different
projections --- what if you wanted to ignore the *first* observation?
--- you can work out how to extend the notation.)

You might think that this is too trivial a property to need a name, let
alone to have to worry about. The point of giving this a name comes from the
Kolmogorov extension theorem: If the $ P_n $ are a projective family for all
finite $n$, then there exists a well-defined probability measure
on *infinite* sequences, of which all the finite-dimensional
distributions are projections.

That's probability theory. The statistical issue comes when we specify
models through their distributions over different sample spaces. Often, in
surveys or in regression, we just give a marginal distribution for samples of
size 1, and say "and we assume the data are IID", which means the joint
distribution over larger samples are products of the 1D distribution, and
projectivity is automatic. In time series, we often specify the model
in *conditional* form, e.g., "Here's $P(X_t|X_{t-1})$" for a Markov
model, and then again, projectivity is automatic. But it turns out there are
many situations in network data
analysis and relational learning
where we specify models in a way which gives us a $ P_n $ directly for each $n$,
and then it seems to me to be important to know if those specifications are
projective, because otherwise, what on Earth do those distributions even mean?

Alessandro Rinaldo and I were able to give necessary and sufficient conditions for projectivity in exponential families. The conditions have to do with the sufficient statistics of the family, and have to do with how the values of those statistics can be altered by the additional data you get at the larger sample space. (The exact conditions are too algebraic-combinatorial for me to summarize pithily.) Applying those conditions to exponential-family random graph models shows that many popular specifications are not, in fact, projective, so that the distribution they give you on social networks of (say) 2499 people is not what you'd get by summing over networks of 2500 people. (There was important prior art here by Tom Snijders, which we didn't cite because we weren't aware of it, and we should have been.)

Some queries I have, to which I am not devoting a lot of time at the moment, but want to keep track of:

- Can the result be extended beyond exponential families to
*all*distributions with sufficient statistics? (I've tried showing this, using the Neyman factorization criterion / characterization for sufficiency, and gotten some headway but not been able to make it work.) - Can we characterize projective families of models even if
they
*don't*have sufficient statistics? (This'd be great but I'd be very surprised.) - When projectivity fails, so that $ P_m(\cdot;\theta) \neq P_n(\cdot;\theta) $, there is presumably some least-false effective parameter value at the smaller size, i.e., a $\pi_{n \mapsto m}(\theta)$ (you should excuse the expression) so that $ P_m(\cdot; \pi_{n \mapsto m}(\theta)) $ comes closest to $\pi_{n \mapsto m}(P_n(\cdot; \theta))$, perhaps in Kullback-Leibler divergence. Can we characterize those least-false parameter values?
- When a family isn't projective, might we rescue something useful by seeing
if it is, in some sense,
*asymptotically*projective? Of course, we'd have to fix exactly what that meant. (Perhaps: the distance between $ P_m $ and $\pi_{n \mapsto m} P_{n}$ is upper-bounded for all $n > m$, and the upper bound is decreasing in $m$ towards 0?)

- Recommended:
- Tom A. B. Snijders, "Conditional Marginalization for Exponential Random Graph Models", Journal of Mathematical Sociology
**34**(2010): 239--252 [Reprint]

- Modesty forbids me to recommend:
- CRS and Alessandro
Rinaldo, "Consistency under Sampling of Exponential Random Graph
Models", Annals of
Statistics
**41**(2013): 508--535, arxiv:1111.3054 [More]

- To read:
- Manfred Jaeger, Oliver Schulte, "Inference, Learning, and Population Size: Projectivity for SRL Models", arxiv:1807.00564
- Ondřej Kuželka, Yuyi Wang, Jesse Davis and Steven Schockaert, "Relational Marginal Problems: Theory and Estimation", in Proceedings of the 32nd AAAI conference on Artificial Intelligence [AAAI 2018] [Thanks to reader S.M. for bringing this to my attention in connection to the third of my problems above]