Despite the title, this isn't *really* about neural networks; or not
mostly. (I'll come back to that.) Rather, it's a very good treatise on the
mathematical theory
of supervised machine learning. It is extremely clear, and largely
self-contained given working knowledge of linear algebra, vector calculus,
probability and elementary combinatorics. Since I've written about the basic
ideas of learning theory at great length elsewhere, I'm going to assume a
reader already familiar with much of the jargon.

Part I deals with binary classification, emphasizing that if we want to be
able to learn a reliable classification rule under *any* distribution of
data, we need to limit ourselves to machines, or families of rules, which have
finite Vapnik-Chervonenkis
dimension. Two interesting and unusual features of this part, which come
back many times, are results on machines which are built in layers of simpler
components, and geometric approaches to bounding VC dimension. The latter, to
over-simplify a bit, rest on dividing the parameter space up into regions which
all give the same output, and then measuring the complexity of this partition.
Another nice feature here is the presumption that no classification rule in the
model family is completely correct. The special case where there is a
completely correct rule is called "the restricted model"^{1}.

Part II also deals with binary classification, but specifically for probabilistic or margin-based classifiers, since the output of the machine is a number in [0,1], and the goal is to use this number not just to classify correctly, but to do so with high confidence or with large margins. This leads naturally to generalizations of the VC dimension for continuous functions, such as the "pseudo-dimension" (=VC dimension of sets formed by thresholding the functions), which are connected to covering numbers (= how many balls of a given radius are needed to cover the whole of some function space), and so to uniform convergence of sample averages to expectations. It is then very natural to use the same machinery in Part III to learning continuous-valued functions, especially regression functions. Here particularly fast learning rates become possible for convex (or nearly convex) function classes (ch. 20). There's also interesting material on interpolation, and some of the pathologies which could arise if we learned noise-free functions.

Part IV is algorithmic: it defines efficient algorithms for learning as ones
which run in polynomial time (in the sample size and the model complexity), and
which only need a polynomial number of samples (in terms of the model
complexity, the approximation error, and the risk of failing to achieve that
approximation). Efficient learning is possible if and only if efficient,
randomized error minimization is possible. This means that when efficient
learning is possible, error minimization belongs to the class of problems
solvable in polynomial time by randomized algorithms, or RP. The last few
chapters show that neural network learning cannot *in general* be done
efficiently, by transforming standard NP-hard problems into error minimization
for (first) binary perceptrons, and then for a series of more promising-seeming
multi-layer networks. Thus unless NP is really just contained within RP, which
is almost as implausible as NP=P, neural networks can't be learned efficiently
in general. (This is the only part of the book where some prior acquaintance
with theoretical computer science would be helpful.) The final chapter
provides constructive algorithms for nonlinear networks without hidden layers,
one a form of back-fitting and the other
of boosting.

There are, naturally, limitations to the book. Having been written in 1999 is not as much of one as might be supposed; the foundational mathematical theory is still there, and indeed I still find this a very useful reference for current research. Rather, I'd point to the following:

- The emphasis on VC theory makes a certain amount of sense, since it is fundamental to distribution-free learning (i.e., learning under all possible distributions). But it would be nice, in a modern course, to have some treatement of distribution-dependent bounds (e.g., using Rademacher complexity), or ones based on stability of predictions.
- The treatment of model selection, while non-trivial, is not as extensive as I'd have really liked.
- As I mentioned at the start, it only considers supervised learning (so, e.g., no reinforcement learning) and then only classification and regression (e.g., no ranking).
- While the
*mathematical*material is extremely general, most of the concrete examples are in fact for feed-forward networks.

The last of those points deserves more elaboration, however. Neural
networks were popular when this book was published, but as a fad they were
already starting to be replaced by kernel-based methods. Now, under the label
of "deep learning", neural networks are popular again, which is I imagine why
the book is back in print. My slightly cynical and under-informed take is that
there has been no actual improvement in the theory of neural networks (let
alone steps towards biological realism), just decades of dedicated and skillful
engineering (and lots and lots of hardware). Soon enough the fad-driven part
of ML will latch on to yet another miracle technology^{2} --- to make something up,
using locality-sensitive hashing
to re-invent Gershenfeld et al.'s
cluster-weighted
modeling ^{3} --- and neural networks
will go back to being the technology of the future. When that time comes, this
will still be a very valuable book.

[1]: Most texts reverse figure and ground here, treating situations where no rule is exactly right as the marked case of "agnostic learning". Anthony and Bartlett's framing is obviously much more realistic and sensible. ^

[2]: Somebody should write a "Simple Model of the Evolution of Supervised Learning Methods"; an essential requirement would be that there are lots and lots of easy variations to try out on many different domains. But on that front I've paid my dues. ^

[3]: If by some miracle there's something to that, just put me in the acknowledgments. ^

Paperback, ISBN 978-0-521-11862-0

1 June 2016