\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\zprime}{z^{\prime}} \newcommand{\Zprime}{Z^{\prime}} \newcommand{\Eta}{H} \newcommand{\equdist}{\stackrel{d}{=}} \newcommand{\indep}{\mathrel{\perp\llap{\perp}}} \]

Attention conservation notice: 2700+ words, expounding a mathematical paper on statistical learning theory. Largely written months ago, posted now in default of actual content.

For the CMU statistical learning theory reading group, I decided to present this:

- Ben London and Bert Huang and Benjamin Taskar and Lise Getoor, "Collective Stability in Structured Prediction: Generalization from One Example", in Sanjoy Dasgupta and David McAllester (eds.), Proceedings of the 30th International Conference on Machine Learning [ICML-13] (2013): 828--836
*Abstract*: Structured predictors enable joint inference over multiple interdependent output variables. These models are often trained on a small number of examples with large internal structure. Existing distribution-free generalization bounds do not guarantee generalization in this setting, though this contradicts a large body of empirical evidence from computer vision, natural language processing, social networks and other fields. In this paper, we identify a set of natural conditions — weak dependence, hypothesis complexity and a new measure, collective stability — that are sufficient for generalization from even a single example, without imposing an explicit generative model of the data. We then demonstrate that the complexity and stability conditions are satisfied by a broad class of models, including marginal inference in templated graphical models. We thus obtain uniform convergence rates that can decrease significantly faster than previous bounds, particularly when each structured example is sufficiently large and the number of training examples is constant, even one.

The question being grappled with here is how we can learn from *one*
example, really from one realization of a stochastic process. Our usual
approach in statistics and machine learning is to assume we have many,
independent examples from the same source. It seems very odd to say that if we
see a single big, internally-dependent example, we're as much in the dark about
the data source and its patterns as if we'd observed a single one-dimensional
measurement, but that's really all a lot of our theory can do for us. Since we
know that animals and machines often *can* successfully learn
generalizable patterns from single realizations, there needs to be some
explanation of how the trick is turned...

This paper is thus relevant to my interests
in dependent
learning, time
series and spatio-temporal data,
and networks.
I read it when it first came out, but I wasn't at all convinced that I'd really
understood it, which was why I volunteered to present this. Given this, I
skipped sections 6 and 7, which specialize from pretty general learning theory
to certain kinds
of graphical
models. It's valuable to show that the assumptions of the general
theory *can* be realized, and by a non-trivial class of models at that,
but they're not really my bag.

At a very high level, the strategy used to prove a generalization-error bound here is fairly familiar in learning theory. Start by establishing a deviation inequality for a single well-behaved function. Then prove that the functions are "stable", in the sense that small changes to their inputs can't alter their outputs too much. The combination of point-wise deviation bounds and stability then yields concentration bounds which hold uniformly over all functions. The innovations are in how this is all made to work when we see one realization of a dependent process.

The data here is an $n$-dimensional vector of random variables, $Z = (Z_1,
Z_2, \ldots Z_n)$ is an $n$-dimensional object. N.B., $n$ here is NOT number
of samples, but the dimensionality of our one example. (I might have preferred
something like $p$ here personally.) We do *not* assume that the $Z_i$
are independent, Markov, exchangeable, stationary, etc., just that $Z$ obeys
some stochastic process or other.

We are interested in functions of the whole of $Z$, $g(Z)$. We're going to
assume that they have a "bounded difference" property: that if $z$ and
$\zprime$ are two realizations of $Z$, which differ in only a single
coordinate, then $|g(z) - g(\zprime)| \leq c/n$ for some $c$ which doesn't care
about *which* constant we perturb.

With this assumption, if the $Z_i$ were IID, the ordinary (McDiarmid) bounded differences inequality would say \[ \Prob{g(Z) - \Expect{g(Z)} \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{c^2} \right\} } \]

This sort of deviation inequality is the bread-and-butter of IID learning theory, but now we need to make it work under dependence. This needs a probabilistic assumption: changing one coordinate alone can't change the function $f$ too much, but it mustn't also imply changes to many other coordinates.

The way London *et al.* quantify this is to use
the $\eta$-dependence coefficients introduced by
Aryeh "Absolutely
Regular" Kontorovich.
Specifically, pick some ordering of the $Z_i$ variables. Then the
$\eta$-dependence between positions $i$ and $j$ is
\[
\eta_{ij} = \sup_{z_{1:i-1}, z_i, \zprime_i}{{\left\|P\left(Z_{j:n}\middle| Z_{1:i-1}= z_{1:i-1}, Z_i = z_i\right) - P\left(Z_{j:n}\middle| Z_{1:i-1}= z_{1:i-1}, Z_i = \zprime_i\right) \right\|}_{TV}}
\]
I imagine that if you are Aryeh, this is transparent, but the rest of us need to take it apart to see how it works...

Fix $z_{1:i-1}$ for the moment. Then the expression above would say how
much can changing $Z_i$ matter for what happens from $j$ onwards; we might call
it how much *influence* $Z_i$ has, in the context $z_{1:i-1}$. Taking
the supremum over $z_{1:i-1}$ shows how much influence $Z_i$ could have, if we
set things up just right.

Now, for book-keeping, set $\theta_{ij} = \eta_{ij}$ if $i < j$, $=1$ if $i=j$, and $0$ if $i > j$. This lets us say that $\sum_{j=1}^{n}{\theta_{ij}}$ is (roughly) how much influence $Z_i$ could exert over the whole future.

Since we have no reason to pick out a particular $Z_i$, we ask how influential the most influential $Z_i$ could get: \[ \|\Theta_n\|_{\infty} = \max_{i\in 1:n}{\sum_{j=1}^{n}{\theta_{ij}}} \] Because this quantity is important and keeps coming up, while the matrix of $\theta$'s doesn't, I will depart from the paper's notation and give it an abbreviated name, $\Eta_n$.

Now we have the tools to assert Theorem 1 of London et al., which is (as they say) essentially Theorem 1.1 of Kontorovich and Ramanan:

That is, the effective sample size is $n/\Eta_n^2$, rather than $n$, because of the dependence between observations. (We have seen similar deflations of the number of effective observations before, when we looked at mixing, and even in the world's simplest ergodic theorem.) I emphasize that we are not assuming any Markov property/conditional independence for the observations, still less that $Z$ breaks up into independent chucks (as in an $m$-dependent sequence). We aren't even assuming a bound or a growth rate for $\Eta_n$. If $\Eta_n = O(1)$, then for each $i$, $\eta_{ij} \rightarrow 0$ as $j \rightarrow \infty$, and we have what Kontorovich and Ramanan call an $\eta$-mixing process. It is not clear whether this is stronger than, say, $\beta$-mixing. (Two nice questions, though tangential here, are whether $\beta$ mixing would be enough, and, if not whether our estimator of $\beta$-mixing be adapted to get $\eta_{ij}$ coefficients?)Theorem 1:Suppose that $g$ is a real-valued function which has the bounded-differences property with constant $c/n$. Then \[ \Prob{g(Z) - \Expect{g(Z)} \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{c^2 \Eta_n^2} \right\} } \]

To sum up, if we have just *one* function $f$ with the
bounded-difference property, then we have a deviation inequality: we can bound
how far below its mean it should be. Ultimately the functions we're going to
be concerned with are the combinations of models with a loss function, so we
want to control deviations for not just one model but for a whole model class...

(In fact, at some points in the paper London *et al.* distinguish
between the dimension of the data ($n$) and the dimension of the output vector
($N$). Their core theorems presume $n=N$, but I think one could maintain the
distinction, just at some cost in notational complexity.)

Ordinarily, when people make stability arguments in learning theory, they
have the stability
of *algorithms* in mind: perturbing (or omitting) one data point
should lead to only a small change in the algorithm's output. London
*et al.*, in contrast, are interested in the stability of
*hypotheses*: small tweaks to $z$ should lead to only small changes in
the vector $f(z)$.

Definition. A vector-valued function $f$ iscollectively $\beta$-stableiff, when $z$ and $\zprime$ are off-by-one, then $\| f(z) - f(\zprime) \|_1 \leq \beta$. The function class $\mathcal{F}$ isuniformly collectively $\beta$-stableiff every $f \in \mathcal{F}$ is $\beta$-stable.

Now we need to de-vectorize our functions. (Remember, ultimately we're interested in the loss of models, so it would make sense to average their losses over all the dimensions over which we're making predictions.) For any $f$, set \[ \overline{f}(z) \equiv \frac{1}{n}\sum_{i=1}^{n}{f_i(z)} \]

(In what seems to me a truly unfortunate notational choice, London *et
al.* wrote what I'm calling $\overline{f}(z)$ as $F(z)$, and wrote
$\Expect{\overline{f}(Z)}$ as $\overline{F}$. I, and much of the reading-group
audience, found this confusing, so I'm trying to streamline.)

Now notice that if $\mathcal{F}$ is uniformly $\beta$-stable, if we pick any $f$ in $\mathcal{F}$, its sample average $\overline{f}$ must obey the bounded difference property with constant $\beta/n$. So sample averages of collectively stable functions will obey the deviation bound in Theorem 1.

Can we extend this somehow into a concentration inequality,
a deviation bound that holds *uniformly* over $\mathcal{F}$?

Let's look at the worst case deviation: \[ \Phi(z) = \sup_{f \in \mathcal{F}}{\Expect{\overline{f}(Z)} - \overline{f}(z)} \] (Note: Strictly speaking, $\Phi$ is also a function of $\mathcal{F}$ and $n$, but I am suppressing that in the notation. [The authors included the dependence on $\mathcal{F}$.])

To see why controlling $\Phi$ gives us concentration, start with the fact that, by the definition of $\Phi$, \[ \Expect{\overline{f}(Z)} - \overline{f}(Z) \leq + \Phi(Z) \] so \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Phi(Z) \] not just with almost-surely but always. If in turn $\Phi(Z) \leq \Expect{\Phi(Z)} + \epsilon$, at least with high probability, then we've got \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Expect{\Phi(Z)} + \epsilon \] with the same probability.

There are many ways one could try to show that $\Phi$ obeys a deviation inequality, but the one which suggests itself in this context is that of showing $\Phi$ has bounded differences. Pick any $z, \zprime$ which differ in just one coordinate. Then \begin{eqnarray*} \left|\Phi(z) - \Phi(\zprime)\right| & = & \left| \sup_{f\in\mathcal{F}}{\left\{ \Expect{\overline{f}(Z)} - \overline{f}(z)\right\}} - \sup_{f\in\mathcal{F}}{\left\{ \Expect{\overline{f}(Z)} - \overline{f}(\zprime)\right\}} \right|\\ & \leq & \left| \sup_{f \in \mathcal{F}}{ \Expect{\overline{f}(Z)} - \overline{f}(z) - \Expect{\overline{f}(Z)} + \overline{f}(\zprime)}\right| ~ \text{(supremum over differences is at least difference in suprema)}\\ & = & \left|\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\zprime) - f_i(z)}}\right| \\ &\leq& \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{|f_i(\zprime) - f_i(z)|}} ~ \text{(Jensen's inequality)}\\ & = & \frac{1}{n}\sup_{f\in \mathcal{F}}{\|f(\zprime) - f(z)\|_1} ~ \text{(definition of} \ \| \|_1) \\ & \leq & \frac{\beta}{n} ~ \text{(uniform collective stability)} \end{eqnarray*} Thus Theorem 1 applies to $\Phi$: \[ \Prob{\Expect{\Phi(Z)} - \Phi(Z) \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{\beta^2 \Eta_n^2} \right\}} \] Set the right-hand side to $\delta$ and solve for $\epsilon$: \[ \epsilon = \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \] Then we have, with probability at least $1-\delta$, \[ \Phi(Z) \leq \Expect{\Phi(Z)} + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \] Hence, with the same probability, uniformly over $f \in \mathcal{F}$, \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Expect{\Phi(Z)} + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \]

Our next step is to replace the expected supremum of the empirical process,
$\Expect{\Phi(Z)}$, with something more tractable and familiar-looking.
Really *any* bound on this could be used, but the authors provide a
particularly nice one, in terms of the Rademacher complexity.

Recall how the Rademacher complexity works when we
have a class $\mathcal{G}$ of scalar-valued function $g$ of an IID sequence
$X_1, \ldots X_n$: it's
\[
\mathcal{R}_n(\mathcal{G}) \equiv \Expect{\sup_{g\in\mathcal{G}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i g(X_i)}}}
\]
where we *introduce* the Rademacher random variables $\sigma_i$, which
are $\pm 1$ with equal probability, independent of each other *and* of
the $X_i$. Since the Rademacher variables are the binary equivalent of white
noise, this measures how well our functions can seem to
correlate with noise, and so how well they can seem to match any damn
thing.

What the authors do in Definition 2 is adapt the definition of Rademacher complexity to their setting in the simplest possible way: \[ \mathcal{R}_n(\mathcal{F}) \equiv \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}} \] In the IID version of Rademacher complexity, each summand involves applying the same function ($g$) to a different random variable ($X_i$). Here, in contrast, each summand applies a different function ($f_i$) to the same random vector ($Z$). This second form can of course include the first as a special case.

Now we would like to relate the Rademacher complexity somehow to the expectation of $\Phi$. Let's take a closer look at the definition there: \[ \Expect{\Phi(Z)} = \Expect{\sup_{f\in\mathcal{F}}{\Expect{\overline{f}(Z)} - \overline{f}(Z)}} \] Let's introduce an independent copy of $Z$, say $\Zprime$, i.e., $Z \equdist \Zprime$, $Z\indep \Zprime$. (These are sometimes called "ghost samples".) Then of course $\Expect{\overline{f}(Z)} = \Expect{\overline{f}(\Zprime)}$, so \begin{eqnarray} \nonumber \Expect{\Phi(Z)} & = & \Expect{\sup_{f\in\mathcal{F}}{\Expect{\overline{f}(\Zprime)} - \overline{f}(Z)}} \\ \nonumber & \leq & \Expect{\sup_{f\in\mathcal{F}}{\overline{f}(\Zprime) - \overline{f}(Z)}} ~ \text{(Jensen's inequality again)}\\ \nonumber & = & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}}}\\ & = & \Expect{\Expect{ \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}} \middle| \sigma}} ~ \text{(law of total expectation)} \label{eqn:phi-after-symmetrizing} \end{eqnarray} Look at the summands. No matter what $f_i$ might be, $f_i(\Zprime) - f_i(Z) \equdist f_i(Z) - f_i(\Zprime)$, because $Z$ and $\Zprime$ have the same distribution but are independent. Since multiplying something by $\sigma_i$ randomly flips its sign, this suggests we should be able to introduce $\sigma_i$ terms without changing anything. This is true, but it needs a bit of trickery, because of the (possible) dependence between the different summands. Following the authors, but simplifying the notation a bit, define \[ T_i = \left\{ \begin{array}{cc} Z & \sigma_i = +1\\ \Zprime & \sigma_i = -1 \end{array} \right. ~ , ~ T^{\prime}_i = \left\{ \begin{array}{cc} \Zprime & \sigma_i = +1 \\ Z & \sigma_i = -1 \end{array}\right. \] Now notice that if $\sigma_i = +1$, then \[ f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(\Zprime) - f_i(Z)) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \] On the other hand, if $\sigma_i = -1$, then \[ f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(Z) - f_i(\Zprime)) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \] Since $\sigma_i$ is either $+1$ or $-1$, we have \begin{equation} f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \label{eqn:symmetric-difference-in-terms-of-rad-vars} \end{equation} Substituting \eqref{eqn:symmetric-difference-in-terms-of-rad-vars} into \eqref{eqn:phi-after-symmetrizing} \begin{eqnarray*} \Expect{\Phi(Z)} & \leq & \Expect{\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}} \middle| \sigma}} \\ & = & \Expect{\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i (f_i(T^{\prime}_i) - f_i(T_i))}} \middle | \sigma}} \\ & = & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i(f_i(\Zprime) - f_i(Z))}}}\\ & \leq & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(\Zprime)}} + \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}}\\ & = & 2\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}}\\ & = & 2\mathcal{R}_n(\mathcal{F}) \end{eqnarray*}

This is, I think, a very nice way to show that Rademacher complexity still
controls over-fitting with dependent data. (This result in fact subsumes our
result in
arxiv:1106.0730, and London *et al.* have, I think, a more elegant proof.)

Now we put everything together.

Suppose that $\mathcal{F}$ is uniformly collectively $\beta$-stable. Then with probability at least $1-\delta$, uniformly over $f \in \mathcal{F}$, \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + 2\mathcal{R}_n(\mathcal{F}) + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \]This is not quite the Theorem 2 of London

That message, as promised in the abstract, has three parts. The three conditions which are jointly sufficient to allow generalization from a single big, inter-dependent instance are:

- An isolated change to one part of the instance doesn't change the predictions very much (collective stability, $\beta$ exists and is small);
- Very distant parts of the instance are nearly independent ($\eta$-mixing, $\Eta_n = O(1)$); and
- Our hypothesis class isn't so flexible it could seem to fit any damn thing (shrinking Rademacher complexity, $\mathcal{R}_n \rightarrow 0$).

I suspect this trio of conditions is not jointly *necessary* as well,
but that's very much a topic for the future. I
also have some thoughts about
whether, with dependent data, we *really* want to control
$\Expect{\overline{f}(Z)}$, or rather whether the goal shouldn't be something
else, but that'll take another post.

Posted at June 22, 2014 10:54 | permanent link