The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   159


Foundations and Algorithms

by Robert E. Schapire

and Yoav Freund

MIT Press, 2012

Weak Learners of the World, Unite!

"Boosting" is a machine learning technique for amplifying the power of a learning algorithm which is "weak": not very good, but always at least a little better than chance. We start with a data set and our black-box weak learner. We feed the learner the data and it gives us a prediction function, or model. We look at where that model goes wrong (and, as appropriate, by how much it errs). We then draw a weighted sample of the data, giving more weight to the examples where the first model was wrong. This weighted sample is fed to the weak learner, which gives us a new model or prediction function. We now repeat the cycle, increasing the weight of examples which the latest model got wrong, and decreasing the weight of the examples it got right. We continue this boosting for some number of rounds, and, when we want to make a prediction, we use the average of all the predictive models we've gotten out of the weak learner. (We might weight them so those which did better on their training data count for more in the average.) Finis.

The final ensemble or average of all the weaker learner's models is equivalent to a single, generally very complicated, model. A naive view of Occam's Razor says that it should over-fit horribly. Instead, boosting often works extremely well — and much better than the individual models. It doesn't help when the weak learners give answers which are no better than chance, but if they have even a little bit of an edge over chance, even a little bit of predictive power, boosting can indeed vastly amplify that. Why?

Schapire and Freund's Boosting is a sustained attempt, by two of the founders of the subject, to answer this surprisingly controversial question. One thing they are quite persuasive about is that this is not just a story about the bias-variance trade-off, with the individual models being noisy and unstable, but the ensemble getting stabilized by averaging. (This is the principle of the great Leo Breiman's "bagging", which he tried to use to account for boosting.) This can't be the whole story, because boosting can lead to great improvements even when the individuals models are very stable, so there's no variance to be killed off by averaging. It's also not simply an optimization story, since while there definitely is an objective function which boosting is trying to minimize, there are other algorithms which also target this objective function and do not work anywhere nearly as well.

Schapire and Freund's preferred explanation for the effectiveness of boosting has to do with the "margin" of an ensemble of predictors. Roughly speaking, this is how confident the over-all ensemble is in its predictions (as opposed to individual models within the ensemble). Slightly less roughly, the margin of an individual prediction indicates how much the ensemble would have to be modified to change that prediction. There is always some distribution of margins over training examples; as boosting proceeds, these margins tend to get bigger and bigger. And, as the authors show, it is possible to bound how much an ensemble over-fits in terms of its distribution of margins --- the more that's skewed towards high-margin cases, the lower the out-of-sample error. Requiring a large margin in effect seriously reduces the capacity of the ensemble model to fit arbitrary data, which is where over-fitting comes from.

The first part of the book deals with the fundamental algorithms of boosting, the essential notions of statistical learning theory, and the margin-based explanation for why boosting works. There then follows an quite extraordinary Part II, on different ways of understanding boosting. The first is to look at as a zero-sum game between the weak learner (which is trying to classify points correctly) and the boosting algorithm (which is trying to stump the weak learner). This leads to a truly ingenious proof of von Neumann's minimax theorem. The second interpretation understands boosting in terms of online learning and regret minimization. The third views boosting as an constrained optimization problem, a fascinating perspective that draws equally on information geometry and linear programming. These chapters are to my mind the theoretical high-light of the book, a masterful demonstration of how to connect many different areas of math illuminatingly.

Parts III and IV go beyond the core boosting approach for classifiers to consider multi-class prediction, ranking, and more refined theoretical problems, ending with a stochastic process which is, in a sense, the limit of doing infinitely-many infinitely-small boosting rounds in a finite time. These chapters are also very good, but more specialized and not so spectacular as Parts I and II.

There are, of course, omissions. There are versions of boosting for regression and related continuous-valued prediction problems, which have nice interpretations in terms of functional analysis; these are barely touched on. (There's a lot more in Bühlmann and van de Geer's Statistics for High-Dimensional Data.) There's also no attempt to draw connections to collective problem-solving, except, interestingly, rhetorically.

Boosting is, quite simply, one of the best-written books I've read on machine learning, even if it doesn't definitely settle the central question of why boosting works. It is formally self-contained, and could in principle be approached by any reader with reasonable mathematical maturity whether or not they have prior knowledge of machine learning. In practice I imagine some prior hands-on experience with machine learning tools would be very helpful. With such a background it would be great for self-study, and I'd give a lot to teach a class based on the book.

Hardback, ISBN 978-0-262-01718-3

Probability and Statistics

23 June 2012, small fixes 25 June