The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   165

Neural Network Learning

Theoretical Foundations

by Martin Anthony

and Peter L. Bartlett

Cambridge University Press, 1999

Mild False Advertising (and a Good Thing Too)

Despite the title, this isn't really about neural networks; or not mostly. (I'll come back to that.) Rather, it's a very good treatise on the mathematical theory of supervised machine learning. It is extremely clear, and largely self-contained given working knowledge of linear algebra, vector calculus, probability and elementary combinatorics. Since I've written about the basic ideas of learning theory at great length elsewhere, I'm going to assume a reader already familiar with much of the jargon.

Part I deals with binary classification, emphasizing that if we want to be able to learn a reliable classification rule under any distribution of data, we need to limit ourselves to machines, or families of rules, which have finite Vapnik-Chervonenkis dimension. Two interesting and unusual features of this part, which come back many times, are results on machines which are built in layers of simpler components, and geometric approaches to bounding VC dimension. The latter, to over-simplify a bit, rest on dividing the parameter space up into regions which all give the same output, and then measuring the complexity of this partition. Another nice feature here is the presumption that no classification rule in the model family is completely correct. The special case where there is a completely correct rule is called "the restricted model"1.

Part II also deals with binary classification, but specifically for probabilistic or margin-based classifiers, since the output of the machine is a number in [0,1], and the goal is to use this number not just to classify correctly, but to do so with high confidence or with large margins. This leads naturally to generalizations of the VC dimension for continuous functions, such as the "pseudo-dimension" (=VC dimension of sets formed by thresholding the functions), which are connected to covering numbers (= how many balls of a given radius are needed to cover the whole of some function space), and so to uniform convergence of sample averages to expectations. It is then very natural to use the same machinery in Part III to learning continuous-valued functions, especially regression functions. Here particularly fast learning rates become possible for convex (or nearly convex) function classes (ch. 20). There's also interesting material on interpolation, and some of the pathologies which could arise if we learned noise-free functions.

Part IV is algorithmic: it defines efficient algorithms for learning as ones which run in polynomial time (in the sample size and the model complexity), and which only need a polynomial number of samples (in terms of the model complexity, the approximation error, and the risk of failing to achieve that approximation). Efficient learning is possible if and only if efficient, randomized error minimization is possible. This means that when efficient learning is possible, error minimization belongs to the class of problems solvable in polynomial time by randomized algorithms, or RP. The last few chapters show that neural network learning cannot in general be done efficiently, by transforming standard NP-hard problems into error minimization for (first) binary perceptrons, and then for a series of more promising-seeming multi-layer networks. Thus unless NP is really just contained within RP, which is almost as implausible as NP=P, neural networks can't be learned efficiently in general. (This is the only part of the book where some prior acquaintance with theoretical computer science would be helpful.) The final chapter provides constructive algorithms for nonlinear networks without hidden layers, one a form of back-fitting and the other of boosting.

There are, naturally, limitations to the book. Having been written in 1999 is not as much of one as might be supposed; the foundational mathematical theory is still there, and indeed I still find this a very useful reference for current research. Rather, I'd point to the following:

  1. The emphasis on VC theory makes a certain amount of sense, since it is fundamental to distribution-free learning (i.e., learning under all possible distributions). But it would be nice, in a modern course, to have some treatement of distribution-dependent bounds (e.g., using Rademacher complexity), or ones based on stability of predictions.
  2. The treatment of model selection, while non-trivial, is not as extensive as I'd have really liked.
  3. As I mentioned at the start, it only considers supervised learning (so, e.g., no reinforcement learning) and then only classification and regression (e.g., no ranking).
  4. While the mathematical material is extremely general, most of the concrete examples are in fact for feed-forward networks.
Some of these, especially the last two and the use of Rademacher complexity, are addressed by the more recent textbook of Mohri, Rostamizadeh and Talwalker, though at a lower mathematical level.

The last of those points deserves more elaboration, however. Neural networks were popular when this book was published, but as a fad they were already starting to be replaced by kernel-based methods. Now, under the label of "deep learning", neural networks are popular again, which is I imagine why the book is back in print. My slightly cynical and under-informed take is that there has been no actual improvement in the theory of neural networks (let alone steps towards biological realism), just decades of dedicated and skillful engineering (and lots and lots of hardware). Soon enough the fad-driven part of ML will latch on to yet another miracle technology2 --- to make something up, using locality-sensitive hashing to re-invent Gershenfeld et al.'s cluster-weighted modeling 3 --- and neural networks will go back to being the technology of the future. When that time comes, this will still be a very valuable book.

[1]: Most texts reverse figure and ground here, treating situations where no rule is exactly right as the marked case of "agnostic learning". Anthony and Bartlett's framing is obviously much more realistic and sensible. ^

[2]: Somebody should write a "Simple Model of the Evolution of Supervised Learning Methods"; an essential requirement would be that there are lots and lots of easy variations to try out on many different domains. But on that front I've paid my dues. ^

[3]: If by some miracle there's something to that, just put me in the acknowledgments. ^

Paperback, ISBN 978-0-521-11862-0

Probability and Statistics

1 June 2016