Attention conservation notice:2300 words about a paper other people wrote on learning theory and hypothesis testing. Mostly written last year as part of a never-used handout for 350, and rescued from the drafts folder~~as an exercise in structured procrastination~~so as to avoid a complete hiatus while I work on my own manuscripts.P.S.

Nauroz mubarak.

In a previous installment, we recalled the
Neyman-Pearson lemma of statistical hypothesis testing: If we are trying to
discriminate between signal and noise, and know the distribution of our data
(*x*) both for when a signal is present (*q*) and when there is just
noise (*p*), then the optimal test says "signal" when the likelihood
ratio *q*(*x*)/*p*(*x*) exceeds a certain threshold, and
"noise" otherwise. This is optimal in that, for any given probability of
thinking noise is signal ("size"), it maximizes the **power**, the
probability of detecting a signal when there is one.

The problem with just applying the Neyman-Pearson lemma directly to problems
of interest is the bit about knowing the *exact* distributions of signal
and noise. We should, forgive the expression, be so lucky. The traditional
approach in theoretical statistics, going back to Neyman and Pearson
themselves, has been to look for circumstances where we can get a single test
of good power against a whole *range* of alternatives, no matter what
they are. The assumptions needed for this are often rather special, and
teaching this material means leading students through some of the more arid
sections of
books like these;
the survivors are generally close to insensible by the time they reach the
oases of confidence regions.

At the other extreme, a large part of modern statistics, machine learning
and data mining is about classification problems, where we take
feature-vectors *x* and assign them to one of a finite number of classes.
Generally, we want to do this in a way which matches a given set of examples,
which are presumed to be classified *correctly*. (This is obviously a
massive assumption, but let it pass.) When there are only two classes,
however, this is *exactly* the situation Neyman and Pearson
contemplated; a binary classification rule is just a hypothesis test by another
name. Indeed, this really the situation Neyman discussed in his later work
(like his First Course in Probability and Statistics [1950]),
where he advocated dropping the notion of "inductive inference" in favor of
that of "inductive behavior", asking, in effect, what rule of conduct a
learning agent should adopt so as to act well in the future.

The traditional approach in data-mining is to say that one should either (i)
minimize the total probability of mis-classification, or (ii) assign some costs
to false positives (noise taken for signal) and false negatives (signal taken
for noise) and minimize the expected cost. Certainly I've made this
recommendations plenty of times in
my teaching. But this
is *not* what Neyman and Perason would suggest. After all, the
mis-classification rate, or any weighted combination of the error rates, will
depend on what proportions of the data we look at actually *are* signal
and noise. Which decision rule minimizes the chance of error depends on the
actual proportion of instance of "signal" to those of "noise". If that ratio
changes, a formerly optimal decision rule can become arbitrarily bad. (To give
a simple but extreme example, suppose that 99% of all cases used to be noise.
Then a decision rule which *always* said "noise" would be right 99% of
the time. The minimum-error rule would be *very* close to "always say
'noise'". If the proportion of signal to noise should increase, the
formerly-optimal decision rule could become arbitrarily bad. — The same
is true, *mutatis mutandis*, of a decision rule which minimizes some
weighted cost of mis-classifications.) But a Neyman-Pearson rule, which
maximizes power subject to a constraint on the probability of false positives,
is immune to changes in the *proportions* of the two classes, since it
only cares about the distribution of the observables *given* the
classes. But (and this is where we came in) the Neyman-Pearson rule depends on
knowing the *exact* distribution of observables for the two classes...

This brings us to tonight's reading.

- Clayton Scott and Robert Nowak, "A Neyman-Pearson Approach to Statistical Learning",
IEEE Transactions on Information Theory
**51**(2005): 3806--3819 [PDF reprint via Prof. Scott, PDF preprint via Prof. Nowak] *Abstract:*The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any α>0, the NP lemma specifies the most powerful test of size α, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis. Building on a "fundamental lemma" of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and prove performance guarantees for both. General conditions are given under which NP-SRM leads to strong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histograms and dyadic decision trees.

Statistical learning methods take in data and give back predictors --- here, classifiers. Showing that a learning method works generally means first showing that one can estimate the performance of any individual candidate predictor (with enough data), and then extending that to showing that the method will pick a good candidate.

The first step is an appeal to some sort of stochastic limit theorem, like
the law of large numbers or the ergodic theorem: the data-generating process is
sufficiently nice that if we fix any *one* prediction rule, its
performance on a sufficiently large sample shows how it will perform in the
future. (More exactly: by taking the sample arbitrarily large, we can have
arbitrarily high confidence that in-sample behavior is arbitrarily close to the
expected future behavior.) Here we can represent every classifier by the
region *R* of *x* values where it says "signal". *P*(*R*)
is the true false positive rate, or size, of the classifier,
and *Q*(*R*) is the power. If we fix *R* in advance of looking
at the data, then we can apply the law of large numbers separately to the
"signal" and "noise" training samples, and conclude that, with
high *P*-probability, the fraction of "noise" data points falling
into *R* is close to *P*(*R*), and likewise with
high *Q*-probability the fraction of "signal" points in
*R* is about *Q*(*R*). In fact, we can use results like
Hoeffding's
inequality to say that, after *n* samples (from the appropriate
source), the probability that either of these empirical relative frequencies
differs from their true probabilities by as much as ±*h* is at
most *2 e*^{-2 nh2}. The important point is
that the probability of an error of fixed size goes down exponentially in the
number of samples.

(Except for the finite-sample bound, this is all classical probability theory of the sort familiar to Neyman and Pearson, or for that matter Laplace. Neyman might well have known Bernstein's inequality, which gives similar though weaker bounds here than Hoeffding's; and even Laplace wouldn't've been surprised at the form of the result.)

Now suppose that we have a finite collection of classifier rules, or
equivalently of "say 'signal'"
regions *R*_{1}, *R*_{2}, ... *R*_{m}.
The training samples labeled "noise" give us an estimate of
the *P*(*R*_{i}), the false positive rates, and we
just saw above that the probability of any of these estimates being very far
from the truth is exponentially small; call this error probability
*c*. The probability that *even one* of the estimates is badly off
is at most *cm*. So we take our sample data and throw out all the
classifiers whose false positive rate exceeds α (plus a small, shrinking
fudge factor), and with at least probability 1-*cm* all the rules we're
left with really do obey the size constraint. Having cut down the hypothesis
space, we then estimate the true positive rates or
powers *Q*(*R*_{i}) from the training samples labeled
"signal". Once again, the probability that any one of these estimates is far
from the truth is low, say *d*, and by
the union bound again
the probability that any of them are badly wrong is at most *dm*. This
means that the sample maximum has to be close to the true maximum, and picking
the *R*_{i} with the highest true positive rate then is
(probabilistically) guaranteed to give us a classifier with close to the
maximum attainable power. This is the basic strategy they call "NP empirical
risk minimization". Its success is surprising: I would have guessed that in
adapting the NP approach we'd need to actually estimate the distributions, or
at least the likelihood ratio as a function of *x*, but Scott and Nowak
show that's not true, that all we need to learn is the region *R*. So
long as *M* is finite and fixed, the probability of making a mistake (of
any given magnitude ±*h*) shrinks to zero exponentially
(because *c* and *d* do), so by
the Borel-Cantelli
lemma we will only ever make finitely many mistakes. In fact, we could
even let the number of classifiers or regions we consider grow with the number
of samples, so long as it grows sub-exponentially, and still come to the same
conclusion.

Notice that we've gone from a result which holds *universally* over
the objects in some collection to one which holds *uniformly* over the
collection. Think of it as a game between me and the Adversary, in which the
Adversary gets to name regions *R* and I try to bound their performance;
convergence means I can always find a bound. But it matters who goes first.
Universal convergence means the Adversary picks the region first, and then I
can tailor my convergence claim to the region. Uniform convergence means I
need to state my convergence claim first, and then the Adversary is free to
pick the region to try to break my bound. What the last paragraph showed is
that for finite collections which don't grow too fast, I can always turn a
strategy for winning at universal convergence into one for winning at uniform
convergence. [1]

Nobody, however, wants to use just a finite collection of classifier rules.
The real action is in somehow getting uniform convergence over infinite
collections, for which the simple union bound won't do. There are lots of ways
of turning this trick, but they all involve restricting the class of rules
we're using, so that their outputs are constrained to be more or less similar,
and we can get uniform convergence by approximating the whole collection with a
finite number of representatives. Basically, we need to count not how many
rules there are (infinity), but how many rules we can distinguish based on
their output (at most 2^{n}). As we get more data, we can
distinguish more rules. Either this number keeps growing exponentially, in
which case we're in trouble, or it ends up growing only polynomially, with the
exponent being called the "Vapnik-Chervonenkis dimension". As
any good book on the subject will
explain, this is *not* the same as the number of adjustable parameters.

So, to recap, here's the NP-ERM strategy. We have a collection of
classifier rules, which are equivalent to regions *R*, and this class is
of known, finite VC dimension. One of these regions or classifiers is the best
available approximation to the Neyman-Pearson classifier, because it maximizes
power at fixed size. We get some data which we know is noise, and use it to
weed out all the regions whose empirical size (false positive rate) is too big.
We then use data which we know is signal to pick the region/classifier whose
empirical power (true positive rate) is maximal. Even though we are optimizing
over infinite spaces, we can guarantee that, with high probability, the size
and power of the resulting classifier will come arbitrarily close to those of
the best rule, and even put quantitative bounds on the approximation error
given the amount of data and our confidence level. The strictness of the
approximation declines as the VC dimension grows. Scott and Nowak also show
that you can also pull the structural risk
minimization trick here: maximize the the in-sample true positive
rate, less a VC-theory bound on the over-fitting, and you still get predictive
consistency, even if you let the capacity of the set of classifiers you'll use
grow with the amount of data you have.

What's cool here is that this is a strategy for learning classifiers which
gives us some protection against changes in the distribution, specifically
against changes in the proportion of classes, and we can do
this *without* having to learn the two probability density
functions *p* and *q*, one just learns *R*. Such density
estimation is certainly possible, but densities are much more complicated
and delicate objects than mere sets, and the demands for data are
correspondingly more extreme. (An interesting question, to which I don't know
the answer, is how much we can work out about the ratio
*q*(*x*)/*p*(*x*) by looking at the estimated maximum power
as we vary the size α.) While Scott and Nowak work out detailed
algorithms for some very particular families of classifier rules, their idea
isn't tied to them, and you could certainly use it with, say, support vector
machines.

[1] I learned this trick of thinking about quantifiers as games with the Adversary from Hintikka's Principles of Mathematics Revisited, but don't remember whether it was original to him or he'd borrowed it in turn. — Gustavo tells me that game semantics for logic began with Paul Lorenzen.

Posted at March 21, 2010 15:00 | permanent link