Learning Your Way to Maximum Power

Attention conservation notice: 2300 words about a paper other people wrote on learning theory and hypothesis testing. Mostly written last year as part of a never-used handout for 350, and rescued from the drafts folder ~~as an exercise in structured procrastination~~ so as to avoid a complete hiatus while I work on my own manuscripts.
P.S. Nauroz mubarak.

In a previous installment, we recalled the Neyman-Pearson lemma of statistical hypothesis testing: If we are trying to discriminate between signal and noise, and know the distribution of our data (x) both for when a signal is present (q) and when there is just noise (p), then the optimal test says "signal" when the likelihood ratio q(x)/p(x) exceeds a certain threshold, and "noise" otherwise. This is optimal in that, for any given probability of thinking noise is signal ("size"), it maximizes the power, the probability of detecting a signal when there is one.

The problem with just applying the Neyman-Pearson lemma directly to problems of interest is the bit about knowing the exact distributions of signal and noise. We should, forgive the expression, be so lucky. The traditional approach in theoretical statistics, going back to Neyman and Pearson themselves, has been to look for circumstances where we can get a single test of good power against a whole range of alternatives, no matter what they are. The assumptions needed for this are often rather special, and teaching this material means leading students through some of the more arid sections of books like these; the survivors are generally close to insensible by the time they reach the oases of confidence regions.

At the other extreme, a large part of modern statistics, machine learning and data mining is about classification problems, where we take feature-vectors x and assign them to one of a finite number of classes. Generally, we want to do this in a way which matches a given set of examples, which are presumed to be classified correctly. (This is obviously a massive assumption, but let it pass.) When there are only two classes, however, this is exactly the situation Neyman and Pearson contemplated; a binary classification rule is just a hypothesis test by another name. Indeed, this really the situation Neyman discussed in his later work (like his First Course in Probability and Statistics [1950]), where he advocated dropping the notion of "inductive inference" in favor of that of "inductive behavior", asking, in effect, what rule of conduct a learning agent should adopt so as to act well in the future.

The traditional approach in data-mining is to say that one should either (i) minimize the total probability of mis-classification, or (ii) assign some costs to false positives (noise taken for signal) and false negatives (signal taken for noise) and minimize the expected cost. Certainly I've made this recommendations plenty of times in my teaching. But this is not what Neyman and Perason would suggest. After all, the mis-classification rate, or any weighted combination of the error rates, will depend on what proportions of the data we look at actually are signal and noise. Which decision rule minimizes the chance of error depends on the actual proportion of instance of "signal" to those of "noise". If that ratio changes, a formerly optimal decision rule can become arbitrarily bad. (To give a simple but extreme example, suppose that 99% of all cases used to be noise. Then a decision rule which always said "noise" would be right 99% of the time. The minimum-error rule would be very close to "always say 'noise'". If the proportion of signal to noise should increase, the formerly-optimal decision rule could become arbitrarily bad. — The same is true, mutatis mutandis, of a decision rule which minimizes some weighted cost of mis-classifications.) But a Neyman-Pearson rule, which maximizes power subject to a constraint on the probability of false positives, is immune to changes in the proportions of the two classes, since it only cares about the distribution of the observables given the classes. But (and this is where we came in) the Neyman-Pearson rule depends on knowing the exact distribution of observables for the two classes...

This brings us to tonight's reading.

Clayton Scott and Robert Nowak, "A Neyman-Pearson Approach to Statistical Learning", IEEE Transactions on Information Theory 51 (2005): 3806--3819 [PDF reprint via Prof. Scott, PDF preprint via Prof. Nowak]: Abstract: The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any α>0, the NP lemma specifies the most powerful test of size α, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis. Building on a "fundamental lemma" of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and prove performance guarantees for both. General conditions are given under which NP-SRM leads to strong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histograms and dyadic decision trees.

Statistical learning methods take in data and give back predictors --- here, classifiers. Showing that a learning method works generally means first showing that one can estimate the performance of any individual candidate predictor (with enough data), and then extending that to showing that the method will pick a good candidate.

The first step is an appeal to some sort of stochastic limit theorem, like the law of large numbers or the ergodic theorem: the data-generating process is sufficiently nice that if we fix any one prediction rule, its performance on a sufficiently large sample shows how it will perform in the future. (More exactly: by taking the sample arbitrarily large, we can have arbitrarily high confidence that in-sample behavior is arbitrarily close to the expected future behavior.) Here we can represent every classifier by the region R of x values where it says "signal". P(R) is the true false positive rate, or size, of the classifier, and Q(R) is the power. If we fix R in advance of looking at the data, then we can apply the law of large numbers separately to the "signal" and "noise" training samples, and conclude that, with high P-probability, the fraction of "noise" data points falling into R is close to P(R), and likewise with high Q-probability the fraction of "signal" points in R is about Q(R). In fact, we can use results like Hoeffding's inequality to say that, after n samples (from the appropriate source), the probability that either of these empirical relative frequencies differs from their true probabilities by as much as ±h is at most 2 e^{-2 nh²}. The important point is that the probability of an error of fixed size goes down exponentially in the number of samples.

(Except for the finite-sample bound, this is all classical probability theory of the sort familiar to Neyman and Pearson, or for that matter Laplace. Neyman might well have known Bernstein's inequality, which gives similar though weaker bounds here than Hoeffding's; and even Laplace wouldn't've been surprised at the form of the result.)

Now suppose that we have a finite collection of classifier rules, or equivalently of "say 'signal'" regions R₁, R₂, ... R_m. The training samples labeled "noise" give us an estimate of the P(R_i), the false positive rates, and we just saw above that the probability of any of these estimates being very far from the truth is exponentially small; call this error probability c. The probability that even one of the estimates is badly off is at most cm. So we take our sample data and throw out all the classifiers whose false positive rate exceeds α (plus a small, shrinking fudge factor), and with at least probability 1-cm all the rules we're left with really do obey the size constraint. Having cut down the hypothesis space, we then estimate the true positive rates or powers Q(R_i) from the training samples labeled "signal". Once again, the probability that any one of these estimates is far from the truth is low, say d, and by the union bound again the probability that any of them are badly wrong is at most dm. This means that the sample maximum has to be close to the true maximum, and picking the R_i with the highest true positive rate then is (probabilistically) guaranteed to give us a classifier with close to the maximum attainable power. This is the basic strategy they call "NP empirical risk minimization". Its success is surprising: I would have guessed that in adapting the NP approach we'd need to actually estimate the distributions, or at least the likelihood ratio as a function of x, but Scott and Nowak show that's not true, that all we need to learn is the region R. So long as M is finite and fixed, the probability of making a mistake (of any given magnitude ±h) shrinks to zero exponentially (because c and d do), so by the Borel-Cantelli lemma we will only ever make finitely many mistakes. In fact, we could even let the number of classifiers or regions we consider grow with the number of samples, so long as it grows sub-exponentially, and still come to the same conclusion.

Notice that we've gone from a result which holds universally over the objects in some collection to one which holds uniformly over the collection. Think of it as a game between me and the Adversary, in which the Adversary gets to name regions R and I try to bound their performance; convergence means I can always find a bound. But it matters who goes first. Universal convergence means the Adversary picks the region first, and then I can tailor my convergence claim to the region. Uniform convergence means I need to state my convergence claim first, and then the Adversary is free to pick the region to try to break my bound. What the last paragraph showed is that for finite collections which don't grow too fast, I can always turn a strategy for winning at universal convergence into one for winning at uniform convergence. [1]

Nobody, however, wants to use just a finite collection of classifier rules. The real action is in somehow getting uniform convergence over infinite collections, for which the simple union bound won't do. There are lots of ways of turning this trick, but they all involve restricting the class of rules we're using, so that their outputs are constrained to be more or less similar, and we can get uniform convergence by approximating the whole collection with a finite number of representatives. Basically, we need to count not how many rules there are (infinity), but how many rules we can distinguish based on their output (at most 2ⁿ). As we get more data, we can distinguish more rules. Either this number keeps growing exponentially, in which case we're in trouble, or it ends up growing only polynomially, with the exponent being called the "Vapnik-Chervonenkis dimension". As any good book on the subject will explain, this is not the same as the number of adjustable parameters.

So, to recap, here's the NP-ERM strategy. We have a collection of classifier rules, which are equivalent to regions R, and this class is of known, finite VC dimension. One of these regions or classifiers is the best available approximation to the Neyman-Pearson classifier, because it maximizes power at fixed size. We get some data which we know is noise, and use it to weed out all the regions whose empirical size (false positive rate) is too big. We then use data which we know is signal to pick the region/classifier whose empirical power (true positive rate) is maximal. Even though we are optimizing over infinite spaces, we can guarantee that, with high probability, the size and power of the resulting classifier will come arbitrarily close to those of the best rule, and even put quantitative bounds on the approximation error given the amount of data and our confidence level. The strictness of the approximation declines as the VC dimension grows. Scott and Nowak also show that you can also pull the structural risk minimization trick here: maximize the the in-sample true positive rate, less a VC-theory bound on the over-fitting, and you still get predictive consistency, even if you let the capacity of the set of classifiers you'll use grow with the amount of data you have.

What's cool here is that this is a strategy for learning classifiers which gives us some protection against changes in the distribution, specifically against changes in the proportion of classes, and we can do this without having to learn the two probability density functions p and q, one just learns R. Such density estimation is certainly possible, but densities are much more complicated and delicate objects than mere sets, and the demands for data are correspondingly more extreme. (An interesting question, to which I don't know the answer, is how much we can work out about the ratio q(x)/p(x) by looking at the estimated maximum power as we vary the size α.) While Scott and Nowak work out detailed algorithms for some very particular families of classifier rules, their idea isn't tied to them, and you could certainly use it with, say, support vector machines.

[1] I learned this trick of thinking about quantifiers as games with the Adversary from Hintikka's Principles of Mathematics Revisited, but don't remember whether it was original to him or he'd borrowed it in turn. — Gustavo tells me that game semantics for logic began with Paul Lorenzen.

Enigmas of Chance

Posted at March 21, 2010 15:00 | permanent link

Three-Toed Sloth

March 21, 2010

Learning Your Way to Maximum Power