The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   163

Foundations of Machine Learning

by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar

MIT Press, 2012

Machine Learning as Normal Science

This is a fairly gentle introduction to what is now the core theory of machine learning, which one might operationally define as "the kind of thing you see at NIPS or ICML". Foundations of Machine Learning focuses on proving that various modern algorithms, used in contemporary problems, will work well with high probability, and showing the tools and techniques used to craft such proofs generally. It is neither polemical nor eclectic, and un-self-consciously so; this just is what the field is about now. I will follow the authors' example and likewise just plunge in.

Chapter 1 introduces the idea of wanting to learn functions that generalize well from old data to new data, and cross-validation as a practical tool of evaluation.

Chapter 2 explains the fundamental idea of a "probably approximately correct" (PAC) algorithm for a classification problem: you have a collection of \( n \) labeled data points, all of them belonging to one class or the other. This training data consists of independent and identically distributed (IID) draws from some fixed but unknown distribution; you can think of it as the outcome of a well-designed survey. You also have a "hypothesis space" \( H \) of possible prediction functions, which take training points and return labels. I demand that you come up with a way of picking one function \( h \) from \( H \) so that, with probability at least \( 1 -\delta \), the risk that \( h \) mis-classifies a new data point is at most \( \epsilon \). (A variant is that \( h \) must have a risk within \( \epsilon \) of the best function in \( H \), which might be imperfect.) You get to respond with the sample size \( n(\delta,\epsilon) \) you would need to do this. I in turn get to make \( \delta \) and \( \epsilon \) (the "probably" and "approximately" in PAC) as small as I like, and your \( n \) can't grow faster than polynomially in \( 1/\delta \) and \( 1/\epsilon \). Moreover, your guarantee has to hold no matter what the true rule is, or the distribution which generates training examples. Finally, your method must be polynomial time in \( n \), \( 1/\delta \) and \( 1/\epsilon \).

Despite all these restrictions, PAC learning is possible. Suppose, for instance, that \( H \) consists of a finite number of rules, one of which is right. By the law of large numbers, the rate at which each rule makes mistakes on the training data converges to its true risk. One can actually strengthen (Hoeffding's inequality) to say that the probability of any one rule's "in-sample error rate" departing from its true risk by as much as \( \epsilon/2 \) is at most \( 2e^{-n\epsilon^2/2} \). If there are \( |H| \) rules in all, the probability of such a large deviation is at most \( 2|H|e^{-n\epsilon^2/2} \). If this bad event doesn't happen, if the sample is representative for all the rules, then the risk of the best-in-sample rule must be within \( \epsilon/2 + \epsilon/2 = \epsilon \) of the risk of the best rule. Setting \( \delta \leq 2|H|e^{-n\epsilon^2/2} \) and solving for \( n \) gives \( n \leq 2\epsilon^{-2}\ln{\frac{2|H|}{\delta}} \), which is polynomial as requested. (This isn't the tightest bound possible here.)

The finite-hypothesis-space setting is no intrinsic interest; teachers like it because it does three things. First, it shows that PAC learning isn't strictly impossible. Second, it warms students up for the work of combining probabilistic inequalities (here, Hoeffding's inequality) with combinatorial arguments (here, the trivial "union bound") to bound generalization error. Third, it suggests that something fancier is going to be needed to handle infinite hypothesis spaces. Chapter 2 includes a simple, explicit example showing that PAC learning can happen even when \( H \) is infinite, but doesn't say how to handle infinite hypothesis spaces in general.

This is what chapter 3 does. It introduces Rademacher complexity, and shows that it bounds the expected largest gap between a classifier's in-sample and out-of-sample performance. Another application of the law of large numbers shows that the actual gap will tend to be close to the expected gap, so true risk can be bounded in terms of in-sample error, Rademacher complexity, and the desired confidence level \( \delta \). When the Rademacher complexity goes to zero as the sample size increases, we can learn. However, the Rademacher complexity varies with the distribution of the data, which is inconvenient, and involves optimization, which is slow. The VC dimension is introduced as a tool for guaranteeing that the Rademacher complexity will go to zero under all distributions. This chapter also shows how, if the Adversary is left free to design the distribution, finite VC dimension is necessary (not just sufficient) for learning.

Chapter 4 introduces linear classifiers, which categorize data points according to which side of a straight line (plane, hyper-plane) they fall on. The difficulty is that if there is one line which manages to classify the training examples correctly, there are infinitely many others, even when one uses measures like VC dimension, so initial bounds are quite pessimistic. This leads, however, to the idea of looking at the "margin" — roughly speaking, how far the training points are from the rule's boundary between classes, since there are vastly fewer rules with large margins than with narrow ones. Margin can be linked to Rademacher complexity, leading to much nicer generalization bounds which tell us to get wide margins. Maximizing the margin of a linear classifier leads naturally to a nice convex optimization problem (a kind of quadratic programming), whose solution is called a "support vector machine". These three ideas — margin analysis, the link between large margins and low Rademacher complexity, and the importance of convex optimization — are recurring themes for the book.

Linear methods works better in high dimensions (if the margin can be controlled), and if one first takes nonlinear transformations in the data and then uses linear methods in this "feature space", one gets nonlinearity in the original data. Actually computing a huge set of non-linear transformations of the data,however, is awkward. Chapter 5 explains the "kernel trick", which short-cuts this by replacing linear operations in the high-dimensional feature space with applying kernel functions to the original data. The theory for linear machines developed already then translates very simply into theory for kernel machines, and instead of worrying about crafting features/transformations, one crafts kernels. Section 5.5, rather unusually, looks at kernels for sequences, and their definition in terms of finite automata.

Chapter 6 covers boosting, one of the leading methods for combining individually-weak predictors. Having just written far too much about boosting, I won't rehash the subject. This book largely agrees with Freund and Schapire in emphasizing margin-based analyses of why boosting works, and its interesting interpretation as a zero-sum game between a player who tries to find data points which are hard to classify, and another who tries to find accurate classifiers.

Chapter 7 turns to on-line learning, where the goal is not to have low expected future risk, but to retrospectively have done almost as well as the in-hindsight-best hypothesis in the space. They properly emphasize that while their earlier (and later) theorems need IID data, one can get low regret without making any probabilistic assumptions about the data at all. They give about equal attention to "prediction with expert advice", and to particular on-line algorithms for classification (such as the perceptron). They also look at risk bounds for on-line algorithms when you do have IID data. This chapter follows Cesa-Bianchi and Lugosi less closely than I'd have guessed.

Chapters 8 and 9 look at, respectively, more-than-binary classification (self-explanatory) and rankings — given two new data points, say which one is preferred to the other. (Actually, the idea of "ranking" used here is something more like "a possibly-asymmetric relation" than even a partial ordering.) Both require discrete predictions, like binary classification, but with much more structure, which really needs to be taken into account. Again, the main tools here are margin bounds and Rademacher complexity; some of the results in these chapters are new.

Chapter 10 turns to regression, i.e., to predicting real-valued variables. They assume that the loss is bounded, which, since they mostly work with the squared-error loss, implies that the predictions have to be bounded. With this restriction, they transfer the machinery of Rademacher complexity and VC qdimension with little trouble, and so analyze ordinary least squares ("Its performance is typically poor in most applications"), as well as kernel ridge regression, support vector regression, the lasso, etc. The case of unbounded range is, as they say, harder, and is left to references in the end-notes.

So far, all the generalization error bounds have only invoked properties of the hypothesis space, not of the algorithm used to search it. Chapter 11 thus shifts focus to look at bounds based on algorithmic stability --- the notion that if we change just one point in the training data, we can't change the risk of the returned hypothesis too much. If algorithms become more and more stable as the training data grows in size, and stabilize fast enough, this alone is enough to bound the generalization error. (The converse, that generalization implies stability, is not true, though it fails in interesting and complicated ways.) The kernel algorithms advocated earlier turn out to be sufficiently stable.

The remaining chapters are more miscellaneous: chapter 12 looks at dimensionality reduction, mostly principal components analysis and its kernelized versions (including some nonlinear manifold-learning techniques) but also a cute proof of the Johnson-Lindenstrauss theorem on random projections. Chapter 13 is a offers a glimpse of the huge area of automata learning and grammatical inference. (The fascinating work by Kontorovich, Cortes, and Mohri on kernel methods for learning grammars is only mentioned in passing.) Chapter 14 looks at reinforcement learning, building up to ideas of stochastic approximation and Q-learning. Appendices review linear algebra (A), convex optimization (B), probability (C), and deviation inequalities (D) — though appendix D will be a review for instructors, not students.

There are some omissions: unsupervised problems like clustering (other than dimensionality reduction); neural networks; graphical models more generally; minimax lower bounds; non-IID data; Bayesian nonparametrics and PAC-Bayesian results*. Some of these topics are absent because they haven't lent themselves to the style of result the authors favor (though Mohri and Rostamizadeh have written important papers on dependent data), others perhaps are just a case of not being able to fit everything into a mere 400 pages.

This is very much a book about the mathematical theory of learning algorithms. (Applications are left to dozen or so of the extensive exercises; those exercises are generally good.) The sole focus is proving theorems. Those theorems are nicely formulated, and the proofs are clear, well-prepared, and (usually) insightful. Moreover, the material goes well beyond the staples of binary classification and regression, covering lots of interesting modern topics from a unified perspective. The implied reader is a new graduate student in theoretical computer science (caring deeply about proofs of correctness and knowing about computational complexity and finite-state machines, but ignorant of statistics); a good teacher could use it with beginning graduate students in statistics, applied math, or many other subjects. It's not the kind of book which would fire the reader with enthusiasm for the subject, but it is solid; it does the job it set out to do. I can't think of any other book now on the market which I'd rather use to teach a first course in learning theory.

Disclaimer: I asked for, and got, an examination copy of this book from the publisher. Perhaps more importantly, I have a collaborator in common with Mohri. But I have no stake in the book's success.

*: I am limiting myself to "things you'd see at NIPS or ICML without boggling" topics. Of course, once upon a time "machine learning" would have embraced much of good old fashioned AI, computational models of human and animal learning, evolutionary search (e.g., genetic programming), inductive logic, etc. These all got pushed to the margins of the field, or even beyond the margin, during the revolution of the 1980s and 1990s. (Somebody really needs to write a history of this, while the participants are mostly still alive and their e-mail is readable.)

Hardback, ISBN 978-0-262-01825-8

Computers and Computing; Probability and Statistics

4 September 2012