*Attention conservation notice:* I have no taste.

- Julia Spencer-Fleming, Through Evil Days
- Continuation of the long-running series. Pulls off the trick of making domestic troubles, ice storms, and not-very-bright criminals equally threatening. (Previously.)
- Prudence Shen and Faith Erin Hicks, Nothing Can Possibly Go Wrong
- Comic-book mind candy: a school story about evil cheerleaders vs. combat robots. Delightful, though I don't usually care for such unalloyed social realism.
- G. E. R. Lloyd, Disciplines in the Making: Cross-Cultural Perspectives on Elites, Learning, and Innovation
- A learned comparative examination of how fields like mathematics, art, law, religion, history, philosophy and medicine become marked off as cohesive and (more or less) self-regulating disciplines, and some of the consequences of doing so. There are no great revelations here (disciplines promote competence in accepted skills and concepts, which can depend understanding within a limited area but may or may not lead to progress), but the pay-off here is Lloyd's generous erudition, extending across multiple ancient classical cultures and ethnographic records.
- Stéphane Boucheron, Gábor Lugosi and Pascal Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence
- Maxim Raginsky and Igal Sason, "Concentration of Measure
Inequalities in Information Theory, Communications and
Coding", Foundations and Trends in Communications and Information
Theory
**10**(2013): 1--246, arxiv:1212.4663 - Probability distributions in high-dimensional spaces are deeply weird creatures. Suppose we take a point uniformly from the interior of a high-dimensional hyper-sphere of radius 1. Our intuition, from experience with the disc and the three-dimensional ball, is that most such points will be far from the surface of the hyper-sphere. However, a little bit of calculus is enough to show that as the dimension grows, the ratio of the surface area to the volume also grows. This means that a thin, constant-thickness shell just inside the surface ends up containing almost all of the volume --- almost all random points are very close to the surface.
- So far, these are just curious facts about high-dimensional geometry, but they have deeper consequences. This sort of "concentration of measure phenomenon" turns out to be quite generic for high-dimensional distributions where each coordinate follows its own distribution, independent of the other coordinates. There is usually some set which captures almost all of the probability, and other positive-probability sets have to be expanded just a little to overlap with it. If we can figure out how things look on this high-probability set, we know most of what the whole distribution will do.
- More specifically, the concentration of the probability measure on a
particular set turns out to be equivalent to the following: all sufficiently
nice functions are, with very high probability, close to their expectation
values or their medians. In fact, one can derive results of the form "for $n$
dimensions, the probability that any function $f$ in a well-behaved class
$\mathcal{F}$ differs from its expectation value by $\epsilon$ or more is $\leq
C e^{-nr(\epsilon)}$", with
*explicit*values for the constant $C$ and the rate-function $r$. (Often, but not necessarily, $r = O(\epsilon^2)$.) These might depend on the true distribution and on the function class $\mathcal{F}$, but not on the specific function $f$. - To appreciate the significance of this, one needs to think back to more
classic parts of probability theory. The laws of large numbers say that
the
*averages*of many independent random variables converge, asymptotically, on expectation values. With some work, these results can be extended to other sorts of functions of many variables, not just averages, under conditions which amount to saying "the function can't depend too strongly on any one variable". Such laws do not say how fast the convergence happens. The large deviations principles give asymptotic rates for the convergence of averages, and some other functions of many variables, but say little or nothing about what happens at finite $n$. What sets concentration bounds apart is that they put definite limits on the fluctuations away from typical behavior at*finite*sizes --- they are non-asymptotic. - The idea of concentration of measure goes back to work by Paul Lévy
and A. I. Khinchin before World War II. The field has really grown rapidly
over the last two decades, however, partly because of enabling technical
developments, and partly because of the growing demand from applications.
Machine learning has emerged as a field which really, really wants to be able
to give bounds on how well a predictor learned from a finite amount of data
will generalize to new data. In- and out-of- sample prediction errors are
complicated functions of the training data, and the size of the sample is
hardly ever so overwhelming that asymptotics are comforting. Thus
concentration bounds are
*exactly*the sort of thing a learning theorist wants [1]. - These two works are two excellent introductions to concentration of
measure, and its uses for probability theory and its applications. The book of
Boucheron
*et al.*is more pedagogical, particularly in its early chapters, and at once more interested in purely-mathematical issues and in machine learning. That of Raginsky and Sason is not so leisurely or so catholic in its exposition (it's half the length of Boucheron*et al.*), and, understandably, gives a lot more space to applications in information and communication. Both of them, however, do an admirable job of conveying not just the major results, but also the major classes of techniques used to prove them, and so open the way for their readers to go about deriving new results. Boucheron*et al.*has nothing on how to extend concentration from independent to weakly-dependent random variables; and Raginsky and Sason treat martingales much more extensively (in chapter 2), and have a little, but only a little, on more-interesting-to-me stochastic processes. (Since the latter is what really got me interested in concentration, I probably regret this more than most readers will.) They both also largely punt on the very important, but forbiddingly technical, work of Talagrand and collaborators, though again Raginsky and Sason say a bit more. - I would have absolutely no hesitation in using either to teach a class to advance graduate students in statistics, machine learning, or allied fields, with the choice depending entirely on their background, interests, and funding [2].
*Disclaimer*: Maxim is a friend, and we have vague plans for a paper together.- [1]: That said, concentration in the technical sense is more of a sufficient condition than a necessary one; cf. this interesting recent paper by Mendelson.
- [2]: In the "why oh why can't we have a better
academic publishing system" department, I note that Raginsky and Sason's book
is free online, while Boucheron
*et al.*'s has a list price of $125. The latter is a well-produced and sturdy hardback, but that is a very steep price for the printing, binding and distribution services of the Oxford University Press.

Posted at December 31, 2013 23:59 | permanent link