## December 31, 2013

### Books to Read While the Algae Grow in Your Fur, December 2013

Attention conservation notice: I have no taste.

Julia Spencer-Fleming, Through Evil Days
Continuation of the long-running series. Pulls off the trick of making domestic troubles, ice storms, and not-very-bright criminals equally threatening. (Previously.)
Prudence Shen and Faith Erin Hicks, Nothing Can Possibly Go Wrong
Comic-book mind candy: a school story about evil cheerleaders vs. combat robots. Delightful, though I don't usually care for such unalloyed social realism.
G. E. R. Lloyd, Disciplines in the Making: Cross-Cultural Perspectives on Elites, Learning, and Innovation
A learned comparative examination of how fields like mathematics, art, law, religion, history, philosophy and medicine become marked off as cohesive and (more or less) self-regulating disciplines, and some of the consequences of doing so. There are no great revelations here (disciplines promote competence in accepted skills and concepts, which can deepen understanding within a limited area but may or may not lead to progress), but the pay-off here is Lloyd's generous erudition, extending across multiple ancient classical cultures and ethnographic records.
Stéphane Boucheron, Gábor Lugosi and Pascal Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence
Maxim Raginsky and Igal Sason, "Concentration of Measure Inequalities in Information Theory, Communications and Coding", Foundations and Trends in Communications and Information Theory 10 (2013): 1--246, arxiv:1212.4663
Probability distributions in high-dimensional spaces are deeply weird creatures. Suppose we take a point uniformly from the interior of a high-dimensional hyper-sphere of radius 1. Our intuition, from experience with the disc and the three-dimensional ball, is that most such points will be far from the surface of the hyper-sphere. However, a little bit of calculus is enough to show that as the dimension grows, the ratio of the surface area to the volume also grows. This means that a thin, constant-thickness shell just inside the surface ends up containing almost all of the volume --- almost all random points are very close to the surface.
So far, these are just curious facts about high-dimensional geometry, but they have deeper consequences. This sort of "concentration of measure phenomenon" turns out to be quite generic for high-dimensional distributions where each coordinate follows its own distribution, independent of the other coordinates. There is usually some set which captures almost all of the probability, and other positive-probability sets have to be expanded just a little to overlap with it. If we can figure out how things look on this high-probability set, we know most of what the whole distribution will do.
More specifically, the concentration of the probability measure on a particular set turns out to be equivalent to the following: all sufficiently nice functions are, with very high probability, close to their expectation values or their medians. In fact, one can derive results of the form "for $n$ dimensions, the probability that any function $f$ in a well-behaved class $\mathcal{F}$ differs from its expectation value by $\epsilon$ or more is $\leq C e^{-nr(\epsilon)}$", with explicit values for the constant $C$ and the rate-function $r$. (Often, but not necessarily, $r = O(\epsilon^2)$.) These might depend on the true distribution and on the function class $\mathcal{F}$, but not on the specific function $f$.
To appreciate the significance of this, one needs to think back to more classic parts of probability theory. The laws of large numbers say that the averages of many independent random variables converge, asymptotically, on expectation values. With some work, these results can be extended to other sorts of functions of many variables, not just averages, under conditions which amount to saying "the function can't depend too strongly on any one variable". Such laws do not say how fast the convergence happens. The large deviations principles give asymptotic rates for the convergence of averages, and some other functions of many variables, but say little or nothing about what happens at finite $n$. What sets concentration bounds apart is that they put definite limits on the fluctuations away from typical behavior at finite sizes --- they are non-asymptotic.
The idea of concentration of measure goes back to work by Paul Lévy and A. I. Khinchin before World War II. The field has really grown rapidly over the last two decades, however, partly because of enabling technical developments, and partly because of the growing demand from applications. Machine learning has emerged as a field which really, really wants to be able to give bounds on how well a predictor learned from a finite amount of data will generalize to new data. In- and out-of- sample prediction errors are complicated functions of the training data, and the size of the sample is hardly ever so overwhelming that asymptotics are comforting. Thus concentration bounds are exactly the sort of thing a learning theorist wants [1].
These two works are two excellent introductions to concentration of measure, and its uses for probability theory and its applications. The book of Boucheron et al. is more pedagogical, particularly in its early chapters, and at once more interested in purely-mathematical issues and in machine learning. That of Raginsky and Sason is not so leisurely or so catholic in its exposition (it's half the length of Boucheron et al.), and, understandably, gives a lot more space to applications in information and communication. Both of them, however, do an admirable job of conveying not just the major results, but also the major classes of techniques used to prove them, and so open the way for their readers to go about deriving new results. Boucheron et al. has nothing on how to extend concentration from independent to weakly-dependent random variables; and Raginsky and Sason treat martingales much more extensively (in chapter 2), and have a little, but only a little, on more-interesting-to-me stochastic processes. (Since the latter is what really got me interested in concentration, I probably regret this more than most readers will.) They both also largely punt on the very important, but forbiddingly technical, work of Talagrand and collaborators, though again Raginsky and Sason say a bit more.
I would have absolutely no hesitation in using either to teach a class to advance graduate students in statistics, machine learning, or allied fields, with the choice depending entirely on their background, interests, and funding [2].
Disclaimer: Maxim is a friend, and we have vague plans for a paper together.
[1]: That said, concentration in the technical sense is more of a sufficient condition than a necessary one; cf. this interesting recent paper by Mendelson.
[2]: In the "why oh why can't we have a better academic publishing system" department, I note that Raginsky and Sason's book is free online, while Boucheron et al.'s has a list price of \$125. The latter is a well-produced and sturdy hardback, but that is a very steep price for the printing, binding and distribution services of the Oxford University Press.

Posted at December 31, 2013 23:59 | permanent link