Attention conservation notice: 2200+ words expounding on some notions from statistical theory about over-fitting, plus describing a paper on some experiments trying to apply these ideas to UW-Madison undergrads, concluding with some speculative nihilism about common forms of trying to understand social life, going far beyond anything actually supported by either the general theory or the psychological experiments. Contains equations, ugly graphs, and long quotations from an academic paper.Mostly written several years ago, retrieved from the drafts folder for lack of lack of time to write anything new.
"Capacity control" is a central idea for statistical learning. The great challenge of devising a reliable learning procedure is avoiding over-fitting — getting models to memorizing apparent features of the training data which are really just noise, and will not generalize to new data. The "capacity" of a class of models is, roughly speaking, how many different data sets it can (seem to) fit well. High-capacity model classes are more flexible, but also more prone to over-fitting. In particular — a point insufficiently appreciated by many users — getting a really good fit in-sample from a high capacity model class is not strong evidence that the trained model will generalize well. Capacity control is the art of using models that are flexible enough to get good fits, without using ones which are so flexible that they over-fit.
There are various ways of shaping up this rough notion of "capacity" into something more quantitative, which we can work into theorems and practical tools. One nice one is what's called "Rademacher complexity", which goes as follows. Our models are just simple input-out functions; we get an input \( x \), we apply a particular model/function \( f \) to get a prediction \( f(x) \), and we want this to be close to the actual output \( y \). The Rademacher complexity of the class of models \( F \) is \[ R_n(F) \equiv \mathbf{E}_{X}\left[\mathbf{E}_{Z}\left[ \sup_{f\in F}{\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i) } \right|} \right]\right] \] where the \( Z_i \) are a sequence of random variables, independent of each other and everything else, and equal to +1 or -1 with equal probability, a.k.a. "Rademacher random variables". (It's customary to include an over-all factor of 2 in this definition, but I have my reasons for leaving it out here.)
To figure out what's going on here, start with the inner-most bit: \[ \left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i) } \right| \] This is the magnitude of the sample covariance between a particular realization of the binary noise \( Z \), and the predictions of a particular model \( f \) on a particular set of input values \( X \). Next, holding fixed the noise and the inputs, we ask how big that covariance could appear to get over the whole model class: \[ \sup_{f\in F}{\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i)} \right|} \] Then we average over realizations of the binary noise: \[ \mathbf{E}_{Z}\left[ \sup_{f\in F}{\left|\frac{1}{n} \sum_{i=1}^{n}{Z_i f(X_i)} \right|} \right] \] This tells us, holding the inputs fixed, how well our model class would typically seem to predict binary noise. This is sometimes called the "empirical Rademacher complexity", because it's conditional on a particular training sample. Finally, we average over input training data. Generally, doing that last step as an actual calculation is hard, but you can show that the probability of the empirical Rademacher complexity fluctuating away from its expected value goes to zero exponentially fast in the sample size \( n \), so we can typically avoid the expectation over \( X \) at a small cost in approximation.
Now, if the class \( F \) just contained a single function, the business with the supremum wouldn't matter, and the Rademacher complexity is just the expected magnitude of the sample covariance of the function with the noise. Then \(R_n(F) \) has to go to zero as \( n \) grows, because, by the law of large numbers, the sample covariance is converging to the true covariance, which is zero. In fact if \( F \) contains only finitely many functions, \( R_n(F) \) has to go to zero because all the sample covariances become, with arbitrarily high probability, arbitrarily close to the true covariance, which is zero. But if there are infinitely many functions in \( F \), there might be trouble. How much trouble depends on, precisely, how many different sets of outputs we can construct for typical inputs.
Why this is a useful way of quantifying model capacity is a bit more involved than I feel like explaining, but the intuition, hopefully, is clear: a class of models which could give a good fit to pure noise is a class of models ripe for over-fitting. If on the other hand the Rademacher complexity of your class of models is low, and you nonetheless got a good fit to your training data, that is actually evidence that you will continue to fit well on new data. There's a simple formula if data are independent draws from a constant (but otherwise arbitrary) distribution, the functions \( f \) also have to output +1 or -1 (i.e., are binary classifiers), and we measure error by the probability of mis-classification. With probability at least \( 1-h \) under the true data-generating distribution, for all \( f \), \[ \mathrm{error}(f) - \widehat{\mathrm{error}(f)} \leq R_n(F) + \sqrt{\frac{\ln{1/h}}{2n}} \] The left-hand side by this bound is the amount by which the true error rate exceeds the in-sample error rate, i.e., the amount by which the model over-fits. The right-hand side combines the Rademacher complexity, which says how well the model class can seem to fit noise, plus a term which grows as we demand a higher level of confidence (smaller \( h \)). The lower the Rademacher complexity, the tighter the bounds we can put on the amount of over-fitting. Turned around, the probability that the amount of over-fitting exceeds the Rademacher complexity by any given amount is exponentially small.
These bounds actually make no reference to the actual learning process — they just depend on the kind of model available, and on how much flexibility those models have when presented with typical data from the distribution. The benefit of this is that we don't have to worry about how our learner works, just the kind of things it can learn. The drawback is that the bounds might be too conservative — maybe this learning procedure, on this sort of data, is much less prone to over-fitting than the bounds indicate.
All of which brings us to today's paper:
The procedure for estimating the Rademacher complexity of a human learner is ingenious. They gave each subject \( n \) labeled examples drawn from some domain, but with labels ("A" and "B" rather than -1 and +1) which were assigned completely randomly --- that is, the labels were Rademacher noise. After letting the subjects study the examples, they were distracted by being asked to do some arithmetic problems involving 10-digit numbers, and then given their \( n \) training examples again, but unlabeled, in a different, randomized order, and without being told these were the same examples. The subjects were asked to classify them, and the empirical Rademacher complexity for the subject was taken to be the sample covariance between guesses and labels; this was averaged over subjects.
This task was inflicted on 80 undergrads (presumably taking Intro. Psych. at Madison), with two different domains. Here I'll just quote:
The "Shape" Domain ... consists of 321 computer-generated 3D shapes ... parametrized by a real number \( x \in [0, 1] \), such that small \( x \) produces spiky shapes, while large \( x \) produces smooth ones.... It is important to note that this internal structure is unnecessary to the definition or measurement of Rademacher complexity per se. .... The "Word" Domain ... consists of 321 English words. We start with the Wisconsin Perceptual Attribute Ratings Database ... which contains words rated by 350 undergraduates for their emotional valence. We sort the words by their emotion valence, and take the 161 most positive and the 160 most negative ones for use in the study. ... Participants have rich knowledge about this domain. The size of the domain for shapes and words was matched to facilitate comparison.
While I'm at it, here's Figure 1 (the error bars being 95% confidence intervals):
As they note, in both domains, Rademacher complexity shrinks as \( n \) grows, so learning should be possible.
when \( n = 5 \), our interviews show that, in both domains, 9 out of 10 participants offered some spurious rules of the random labels. For example, one subject thought the shape categories were determined by whether the shape "faces" downward; another thought the word categories indicated whether the word contains the letter T. Such beliefs, though helpful in learning the particular training samples, amount to over-fitting the noise. In contrast, when \( n = 40 \), about half the participants indicated that they believed the labels to be random, as spurious "rules" are more difficult to find.
Also, the students had higher Rademacher complexity for words, with which they are very familiar, than for strange synthetic shapes:
The higher complexity indicates that, for the same sample sizes, participants are better able to find spurious explanations of the training data for the Words than for the Shapes. Two distinct strategies were apparent in the Word domain interviews: (i) Some participants created mnemonics. For example, one subject received the training sample (grenade, B), (skull, A), (conflict, A), (meadow, B), (queen, B), and came up with the following story: "a queen was sitting in a meadow and then a grenade was thrown (B = before), then this started a conflict ending in bodies & skulls (A = after)." (ii) Other participants came up with idiosyncratic, but often imperfect, rules. For instance, whether the item "tastes good," "relates to motel service," or "physical vs. abstract."
Having measured the Rademacher complexity of Madison undergrads, Zhu et al. then took the same domains and gave the students some actual prediction problems where generalization was possible: separating spiky from not-so-spiky shapes, separating shapes of medium spikiness from those which are either very spiky or very smooth, separating positive-valence words from those of negative valence, and separating words which are over five letters long from shorter words. (They did not tell them that these were the categories, of course.) This lets them compare the actual generalization performance --- i.e., the error rate on new data, not seen during training --- from the bounds implied by the Rademacher complexity.
It is not interesting that the humans' generalization error is lower than the theoretical bound on the generalization error; that goes along with being an upper bound. What's interesting is that Rademacher complexity nonetheless predicts how much over-fitting the students will do. It's not surprising that they do better at guessing the rules when they've seen 40 examples than when they've seen 5, but that seeing 40 word examples leads to more error than seeing 40 shapes is remarkable. What would be very nice, but doesn't appear here, would be experiments were domain, distribution and sample size were all different, but adjusted so that the human Rademacher complexity was equal, which would let us tease out how much that complexity as such predicts human learning.
While we await such experiments, I will speculate irresponsibly. I have long been skeptical about the sort of journalism or scholarship which looks at some trend at purports to tell us what it Really Means — often, in the more scholarly versions, the Deep Social Forces behind it, or the Hidden Ideological Motives. (See, e.g., the "Trees" section here.) Even when the trend actually exists, this seems to me far too vulnerable to Just So story-telling, that the explainers could provide equally compelling stories no matter what actually happened. To put it in terms of this paper, everyone is really, really familiar with the social life of the East African Plains Ape, yet we have very few truly independent historical examples, so our capacity to spin such stories is very large and \( n \) is very small, so \( R_n \) has to be very large, and none of our stories is going to generalize well. This is also part of what Duncan Watts is saying in Everything Is Obvious, Once You Know the Answer, about how, once they hear social-scientific findings, people find it easy to rationalize them as obvious --- but would be equally capable of coming up with equally-convincing rationalizations for the opposite findings. Hence the complete lack of substance in most of what passes for commentary on our common life. (Thomas Friedman is clearly a queen-in-a-meadow-with-grenades-and-skulls guy, while David Brooks is more "relates to motel service".) Some of these stories may even be right, but pointing to how compelling they explain everything we see is actually not very good evidence for them.
Update, 13 November 2014: updated NIPS URLs.
Posted at July 25, 2012 14:25 | permanent link