June 22, 2014

Notes on "Collective Stability in Structured Prediction: Generalization from One Example" (or: Small Pieces, Loosely Joined)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\zprime}{z^{\prime}} \newcommand{\Zprime}{Z^{\prime}} \newcommand{\Eta}{H} \newcommand{\equdist}{\stackrel{d}{=}} \newcommand{\indep}{\mathrel{\perp\llap{\perp}}} \]

Attention conservation notice: 2700+ words, expounding a mathematical paper on statistical learning theory. Largely written months ago, posted now in default of actual content.

For the CMU statistical learning theory reading group, I decided to present this:

Ben London and Bert Huang and Benjamin Taskar and Lise Getoor, "Collective Stability in Structured Prediction: Generalization from One Example", in Sanjoy Dasgupta and David McAllester (eds.), Proceedings of the 30th International Conference on Machine Learning [ICML-13] (2013): 828--836
Abstract: Structured predictors enable joint inference over multiple interdependent output variables. These models are often trained on a small number of examples with large internal structure. Existing distribution-free generalization bounds do not guarantee generalization in this setting, though this contradicts a large body of empirical evidence from computer vision, natural language processing, social networks and other fields. In this paper, we identify a set of natural conditions — weak dependence, hypothesis complexity and a new measure, collective stability — that are sufficient for generalization from even a single example, without imposing an explicit generative model of the data. We then demonstrate that the complexity and stability conditions are satisfied by a broad class of models, including marginal inference in templated graphical models. We thus obtain uniform convergence rates that can decrease significantly faster than previous bounds, particularly when each structured example is sufficiently large and the number of training examples is constant, even one.

The question being grappled with here is how we can learn from one example, really from one realization of a stochastic process. Our usual approach in statistics and machine learning is to assume we have many, independent examples from the same source. It seems very odd to say that if we see a single big, internally-dependent example, we're as much in the dark about the data source and its patterns as if we'd observed a single one-dimensional measurement, but that's really all a lot of our theory can do for us. Since we know that animals and machines often can successfully learn generalizable patterns from single realizations, there needs to be some explanation of how the trick is turned...

This paper is thus relevant to my interests in dependent learning, time series and spatio-temporal data, and networks. I read it when it first came out, but I wasn't at all convinced that I'd really understood it, which was why I volunteered to present this. Given this, I skipped sections 6 and 7, which specialize from pretty general learning theory to certain kinds of graphical models. It's valuable to show that the assumptions of the general theory can be realized, and by a non-trivial class of models at that, but they're not really my bag.

At a very high level, the strategy used to prove a generalization-error bound here is fairly familiar in learning theory. Start by establishing a deviation inequality for a single well-behaved function. Then prove that the functions are "stable", in the sense that small changes to their inputs can't alter their outputs too much. The combination of point-wise deviation bounds and stability then yields concentration bounds which hold uniformly over all functions. The innovations are in how this is all made to work when we see one realization of a dependent process.

Weak Dependence and a Point-wise Deviation Bound

The data here is an $n$-dimensional vector of random variables, $Z = (Z_1, Z_2, \ldots Z_n)$ is an $n$-dimensional object. N.B., $n$ here is NOT number of samples, but the dimensionality of our one example. (I might have preferred something like $p$ here personally.) We do not assume that the $Z_i$ are independent, Markov, exchangeable, stationary, etc., just that $Z$ obeys some stochastic process or other.

We are interested in functions of the whole of $Z$, $g(Z)$. We're going to assume that they have a "bounded difference" property: that if $z$ and $\zprime$ are two realizations of $Z$, which differ in only a single coordinate, then $|g(z) - g(\zprime)| \leq c/n$ for some $c$ which doesn't care about which constant we perturb.

With this assumption, if the $Z_i$ were IID, the ordinary (McDiarmid) bounded differences inequality would say \[ \Prob{g(Z) - \Expect{g(Z)} \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{c^2} \right\} } \]

This sort of deviation inequality is the bread-and-butter of IID learning theory, but now we need to make it work under dependence. This needs a probabilistic assumption: changing one coordinate alone can't change the function $f$ too much, but it mustn't also imply changes to many other coordinates.

The way London et al. quantify this is to use the $\eta$-dependence coefficients introduced by Aryeh "Absolutely Regular" Kontorovich. Specifically, pick some ordering of the $Z_i$ variables. Then the $\eta$-dependence between positions $i$ and $j$ is \[ \eta_{ij} = \sup_{z_{1:i-1}, z_i, \zprime_i}{{\left\|P\left(Z_{j:n}\middle| Z_{1:i-1}= z_{1:i-1}, Z_i = z_i\right) - P\left(Z_{j:n}\middle| Z_{1:i-1}= z_{1:i-1}, Z_i = \zprime_i\right) \right\|}_{TV}} \] I imagine that if you are Aryeh, this is transparent, but the rest of us need to take it apart to see how it works...

Fix $z_{1:i-1}$ for the moment. Then the expression above would say how much can changing $Z_i$ matter for what happens from $j$ onwards; we might call it how much influence $Z_i$ has, in the context $z_{1:i-1}$. Taking the supremum over $z_{1:i-1}$ shows how much influence $Z_i$ could have, if we set things up just right.

Now, for book-keeping, set $\theta_{ij} = \eta_{ij}$ if $i < j$, $=1$ if $i=j$, and $0$ if $i > j$. This lets us say that $\sum_{j=1}^{n}{\theta_{ij}}$ is (roughly) how much influence $Z_i$ could exert over the whole future.

Since we have no reason to pick out a particular $Z_i$, we ask how influential the most influential $Z_i$ could get: \[ \|\Theta_n\|_{\infty} = \max_{i\in 1:n}{\sum_{j=1}^{n}{\theta_{ij}}} \] Because this quantity is important and keeps coming up, while the matrix of $\theta$'s doesn't, I will depart from the paper's notation and give it an abbreviated name, $\Eta_n$.

Now we have the tools to assert Theorem 1 of London et al., which is (as they say) essentially Theorem 1.1 of Kontorovich and Ramanan:

Theorem 1: Suppose that $g$ is a real-valued function which has the bounded-differences property with constant $c/n$. Then \[ \Prob{g(Z) - \Expect{g(Z)} \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{c^2 \Eta_n^2} \right\} } \]
That is, the effective sample size is $n/\Eta_n^2$, rather than $n$, because of the dependence between observations. (We have seen similar deflations of the number of effective observations before, when we looked at mixing, and even in the world's simplest ergodic theorem.) I emphasize that we are not assuming any Markov property/conditional independence for the observations, still less that $Z$ breaks up into independent chucks (as in an $m$-dependent sequence). We aren't even assuming a bound or a growth rate for $\Eta_n$. If $\Eta_n = O(1)$, then for each $i$, $\eta_{ij} \rightarrow 0$ as $j \rightarrow \infty$, and we have what Kontorovich and Ramanan call an $\eta$-mixing process. It is not clear whether this is stronger than, say, $\beta$-mixing. (Two nice questions, though tangential here, are whether $\beta$ mixing would be enough, and, if not whether our estimator of $\beta$-mixing be adapted to get $\eta_{ij}$ coefficients?)

To sum up, if we have just one function $f$ with the bounded-difference property, then we have a deviation inequality: we can bound how far below its mean it should be. Ultimately the functions we're going to be concerned with are the combinations of models with a loss function, so we want to control deviations for not just one model but for a whole model class...

Vectorized Functions and Collective Stability

In a lot of contexts with structured data, we might want to make a prediction (assign a label, take an action) for each component of $Z$. If $Z$ is an image, for instance, and we're doing image segmentation, we might want to say which segment each pixel is in. If $Z$ is text, we might want to assign each word to a part of speech. If $Z$ is a social network, we might want to categorize each node (or edge) in some way. We might also want to output probability distributions over categories, rather than making a hard choice of category. So we will now consider functions $f$ which map $Z$ to $\mathcal{Y}^n$, where $\mathcal{Y}$ is some suitable space of predictions or actions. In other words, our functions output vectors.

(In fact, at some points in the paper London et al. distinguish between the dimension of the data ($n$) and the dimension of the output vector ($N$). Their core theorems presume $n=N$, but I think one could maintain the distinction, just at some cost in notational complexity.)

Ordinarily, when people make stability arguments in learning theory, they have the stability of algorithms in mind: perturbing (or omitting) one data point should lead to only a small change in the algorithm's output. London et al., in contrast, are interested in the stability of hypotheses: small tweaks to $z$ should lead to only small changes in the vector $f(z)$.

Definition. A vector-valued function $f$ is collectively $\beta$-stable iff, when $z$ and $\zprime$ are off-by-one, then $\| f(z) - f(\zprime) \|_1 \leq \beta$. The function class $\mathcal{F}$ is uniformly collectively $\beta$-stable iff every $f \in \mathcal{F}$ is $\beta$-stable.

Now we need to de-vectorize our functions. (Remember, ultimately we're interested in the loss of models, so it would make sense to average their losses over all the dimensions over which we're making predictions.) For any $f$, set \[ \overline{f}(z) \equiv \frac{1}{n}\sum_{i=1}^{n}{f_i(z)} \]

(In what seems to me a truly unfortunate notational choice, London et al. wrote what I'm calling $\overline{f}(z)$ as $F(z)$, and wrote $\Expect{\overline{f}(Z)}$ as $\overline{F}$. I, and much of the reading-group audience, found this confusing, so I'm trying to streamline.)

Now notice that if $\mathcal{F}$ is uniformly $\beta$-stable, if we pick any $f$ in $\mathcal{F}$, its sample average $\overline{f}$ must obey the bounded difference property with constant $\beta/n$. So sample averages of collectively stable functions will obey the deviation bound in Theorem 1.

Stability of the Worst-Case Deviation

Can we extend this somehow into a concentration inequality, a deviation bound that holds uniformly over $\mathcal{F}$?

Let's look at the worst case deviation: \[ \Phi(z) = \sup_{f \in \mathcal{F}}{\Expect{\overline{f}(Z)} - \overline{f}(z)} \] (Note: Strictly speaking, $\Phi$ is also a function of $\mathcal{F}$ and $n$, but I am suppressing that in the notation. [The authors included the dependence on $\mathcal{F}$.])

To see why controlling $\Phi$ gives us concentration, start with the fact that, by the definition of $\Phi$, \[ \Expect{\overline{f}(Z)} - \overline{f}(Z) \leq + \Phi(Z) \] so \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Phi(Z) \] not just with almost-surely but always. If in turn $\Phi(Z) \leq \Expect{\Phi(Z)} + \epsilon$, at least with high probability, then we've got \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Expect{\Phi(Z)} + \epsilon \] with the same probability.

There are many ways one could try to show that $\Phi$ obeys a deviation inequality, but the one which suggests itself in this context is that of showing $\Phi$ has bounded differences. Pick any $z, \zprime$ which differ in just one coordinate. Then \begin{eqnarray*} \left|\Phi(z) - \Phi(\zprime)\right| & = & \left| \sup_{f\in\mathcal{F}}{\left\{ \Expect{\overline{f}(Z)} - \overline{f}(z)\right\}} - \sup_{f\in\mathcal{F}}{\left\{ \Expect{\overline{f}(Z)} - \overline{f}(\zprime)\right\}} \right|\\ & \leq & \left| \sup_{f \in \mathcal{F}}{ \Expect{\overline{f}(Z)} - \overline{f}(z) - \Expect{\overline{f}(Z)} + \overline{f}(\zprime)}\right| ~ \text{(supremum over differences is at least difference in suprema)}\\ & = & \left|\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\zprime) - f_i(z)}}\right| \\ &\leq& \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{|f_i(\zprime) - f_i(z)|}} ~ \text{(Jensen's inequality)}\\ & = & \frac{1}{n}\sup_{f\in \mathcal{F}}{\|f(\zprime) - f(z)\|_1} ~ \text{(definition of} \ \| \|_1) \\ & \leq & \frac{\beta}{n} ~ \text{(uniform collective stability)} \end{eqnarray*} Thus Theorem 1 applies to $\Phi$: \[ \Prob{\Expect{\Phi(Z)} - \Phi(Z) \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{\beta^2 \Eta_n^2} \right\}} \] Set the right-hand side to $\delta$ and solve for $\epsilon$: \[ \epsilon = \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \] Then we have, with probability at least $1-\delta$, \[ \Phi(Z) \leq \Expect{\Phi(Z)} + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \] Hence, with the same probability, uniformly over $f \in \mathcal{F}$, \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Expect{\Phi(Z)} + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \]

Rademacher Complexity

Our next step is to replace the expected supremum of the empirical process, $\Expect{\Phi(Z)}$, with something more tractable and familiar-looking. Really any bound on this could be used, but the authors provide a particularly nice one, in terms of the Rademacher complexity.

Recall how the Rademacher complexity works when we have a class $\mathcal{G}$ of scalar-valued function $g$ of an IID sequence $X_1, \ldots X_n$: it's \[ \mathcal{R}_n(\mathcal{G}) \equiv \Expect{\sup_{g\in\mathcal{G}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i g(X_i)}}} \] where we introduce the Rademacher random variables $\sigma_i$, which are $\pm 1$ with equal probability, independent of each other and of the $X_i$. Since the Rademacher variables are the binary equivalent of white noise, this measures how well our functions can seem to correlate with noise, and so how well they can seem to match any damn thing.

What the authors do in Definition 2 is adapt the definition of Rademacher complexity to their setting in the simplest possible way: \[ \mathcal{R}_n(\mathcal{F}) \equiv \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}} \] In the IID version of Rademacher complexity, each summand involves applying the same function ($g$) to a different random variable ($X_i$). Here, in contrast, each summand applies a different function ($f_i$) to the same random vector ($Z$). This second form can of course include the first as a special case.

Now we would like to relate the Rademacher complexity somehow to the expectation of $\Phi$. Let's take a closer look at the definition there: \[ \Expect{\Phi(Z)} = \Expect{\sup_{f\in\mathcal{F}}{\Expect{\overline{f}(Z)} - \overline{f}(Z)}} \] Let's introduce an independent copy of $Z$, say $\Zprime$, i.e., $Z \equdist \Zprime$, $Z\indep \Zprime$. (These are sometimes called "ghost samples".) Then of course $\Expect{\overline{f}(Z)} = \Expect{\overline{f}(\Zprime)}$, so \begin{eqnarray} \nonumber \Expect{\Phi(Z)} & = & \Expect{\sup_{f\in\mathcal{F}}{\Expect{\overline{f}(\Zprime)} - \overline{f}(Z)}} \\ \nonumber & \leq & \Expect{\sup_{f\in\mathcal{F}}{\overline{f}(\Zprime) - \overline{f}(Z)}} ~ \text{(Jensen's inequality again)}\\ \nonumber & = & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}}}\\ & = & \Expect{\Expect{ \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}} \middle| \sigma}} ~ \text{(law of total expectation)} \label{eqn:phi-after-symmetrizing} \end{eqnarray} Look at the summands. No matter what $f_i$ might be, $f_i(\Zprime) - f_i(Z) \equdist f_i(Z) - f_i(\Zprime)$, because $Z$ and $\Zprime$ have the same distribution but are independent. Since multiplying something by $\sigma_i$ randomly flips its sign, this suggests we should be able to introduce $\sigma_i$ terms without changing anything. This is true, but it needs a bit of trickery, because of the (possible) dependence between the different summands. Following the authors, but simplifying the notation a bit, define \[ T_i = \left\{ \begin{array}{cc} Z & \sigma_i = +1\\ \Zprime & \sigma_i = -1 \end{array} \right. ~ , ~ T^{\prime}_i = \left\{ \begin{array}{cc} \Zprime & \sigma_i = +1 \\ Z & \sigma_i = -1 \end{array}\right. \] Now notice that if $\sigma_i = +1$, then \[ f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(\Zprime) - f_i(Z)) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \] On the other hand, if $\sigma_i = -1$, then \[ f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(Z) - f_i(\Zprime)) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \] Since $\sigma_i$ is either $+1$ or $-1$, we have \begin{equation} f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \label{eqn:symmetric-difference-in-terms-of-rad-vars} \end{equation} Substituting \eqref{eqn:symmetric-difference-in-terms-of-rad-vars} into \eqref{eqn:phi-after-symmetrizing} \begin{eqnarray*} \Expect{\Phi(Z)} & \leq & \Expect{\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}} \middle| \sigma}} \\ & = & \Expect{\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i (f_i(T^{\prime}_i) - f_i(T_i))}} \middle | \sigma}} \\ & = & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i(f_i(\Zprime) - f_i(Z))}}}\\ & \leq & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(\Zprime)}} + \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}}\\ & = & 2\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}}\\ & = & 2\mathcal{R}_n(\mathcal{F}) \end{eqnarray*}

This is, I think, a very nice way to show that Rademacher complexity still controls over-fitting with dependent data. (This result in fact subsumes our result in arxiv:1106.0730, and London et al. have, I think, a more elegant proof.)

Collective Stability and Generalizing from One Big Example

Now we put everything together.

Suppose that $\mathcal{F}$ is uniformly collectively $\beta$-stable. Then with probability at least $1-\delta$, uniformly over $f \in \mathcal{F}$, \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + 2\mathcal{R}_n(\mathcal{F}) + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \]
This is not quite the Theorem 2 of London et al., because they go through some additional steps to relate the collective stability of predictions to the collective stability of loss functions, but at this point I think the message is clear.

That message, as promised in the abstract, has three parts. The three conditions which are jointly sufficient to allow generalization from a single big, inter-dependent instance are:

  1. An isolated change to one part of the instance doesn't change the predictions very much (collective stability, $\beta$ exists and is small);
  2. Very distant parts of the instance are nearly independent ($\eta$-mixing, $\Eta_n = O(1)$); and
  3. Our hypothesis class isn't so flexible it could seem to fit any damn thing (shrinking Rademacher complexity, $\mathcal{R}_n \rightarrow 0$).

I suspect this trio of conditions is not jointly necessary as well, but that's very much a topic for the future. I also have some thoughts about whether, with dependent data, we really want to control $\Expect{\overline{f}(Z)}$, or rather whether the goal shouldn't be something else, but that'll take another post.

Enigmas of Chance

Posted at June 22, 2014 10:54 | permanent link

Three-Toed Sloth