## December 31, 2014

### Books to Read While the Algae Grow in Your Fur, December 2014

Attention conservation notice: I have no taste.

Mike Carey et al., Unwritten, 1: Tommy Taylor and the Bogus Identity
Comic book mind candy, in which the wonderful power of story-telling is wielded with irresponsibly.
Matt Fraction and Gabriel Bà, Casanova, 1: Luxuria
Comic book mind candy. I suspect more familiarity on my part with superhero comics would have revealed even more in-jokes, but it's fun without them. It also raises a question which I am surprised has not received more attention from moralists (ROT-13'd): vf vg ernyyl vaprfg vs vg'f gur irefvba bs lbhe fvoyvat sebz n cnenyyry jbeyq naq abg lbhef? (Actually, is gur pbzovangvba bs vaprfg naq gvzr-yvar ubccvat fbzr fbeg bs gevohgr gb Zbbepbpx'f Wreel Pbearyvhf obbxf? )
Marisa Acocella Marchetto, Cancer Vixen
Comic-book memoir of breast cancer. It's funny and touching, but there were very woo-woo bits which I had to skip over, since they came far too close to the cancer-as-a-hidden-blessing nonsense Ehrenreich skewers so well in her book. (Reading the two books so close together was a coincidence.)
Mary Sisson, Trang and Trust
Mind candy: the trials and tribulations of the mid-level Canadian diplomat who becomes humanity's first accredited representative to the multiple much-more-advanced-than-us alien species with whom we turn out to share a wormhole nexus. I took to it like catnip, and really wish there were more. I especially liked the portions in the second book told from the perspective of one of the aliens, who has a very different set of senses than we do, and (ROT-13'd) jung unq ybbxrq yvxr neovgenel phygheny fghoobeaarff be rira genafyngvba reebef gheaf bhg gb or cynva ovbybtvpny gehguf.
Mind candy; the only swords-and-sorcery fantasy I've encountered about trying to build a functioning state by imposing a uniform system of justice on an unwilling feudal aristocracy. It ends with, if not a literal cliff-hanger, then at least our heroes trapped between two hostile armies, and a truth which has been obvious to the reader for much of the book having just struck the narrator like a revelation. (Admittedly, he's been kind of preoccupied.) I'll look for the inevitable sequel.
Emma Bull, Elizabeth Bear and friends, Shadow Unit
Mind candy, finally read from start to finish. I liked it, except for the ending. Further comment ROT-13'd: Xvyyvat bss zbfg bs gur grnz jnf svar, gubhtu V gubhtug gur jnl vg jnf qbar erdhverq gurz gb or fghcvq. (Vs gur rarzl'f fhcre-cbjre vf znavchyngvat cebonovyvgl, ubj vf fraqvat gur grnz va gb uvf obbol-genccrq ynve fhcrevbe gb whfg sver-obzovat fnvq ynve?) Yvxrjvfr, gheavat Ivyyrggr vagb n fbeg bs nabznybvq Rqjneq Fabjqra frrzrq vzcynhfvoyr.
Barbara Ehrenreich, Brightsided: How the Relentless Promotion of Positive Thinking Has Undermined America
An excellent corrective to one of our characteristic national weaknesses, as well as a shrewd debunking of a vile little ideology. I think she over-reaches a little when she tries to (partly) blame the financial crisis on positive thinking on the part of corporate leaders, and identify it with free-market fundamentalism (at root a very different ideology). But over-all she's sound, she's funny, she's convincing, and she reserves her scorn for those who actually do harm, rather than their dupes. And the chapter about Ehrenreich's own experience of being involuntarily plunged into an environment of 90-proof positive thinking after being diagnosed with breast cancer is very fine (and angry, and annihilating) writing indeed.
Fan R. K. Chung, Spectral Graph Theory
Chung's book is largely about the spectral properties of graph Laplacians, roughly evenly divided in coverage between how to read off geometric or topological properties (e.g., diameter) of arbitrary graphs from their spectra, and results on the spectra of special graphs, like ones where all nodes have the same degree, or expanders. There is also an interesting chapter on "quasi-randomness" in graphs, i.e., properties which guarantee that a particular graph will in many ways look like a typical realization of an Erdos-Renyi process.
What drove me to Chung's book was the importance of spectral graph theory for clustering, specifically spectral clustering, and related methods of dimensionality reduction. I learned a lot from it --- admittedly much of it stuff I should have known already --- but there is actually rather little about those particular applications here. This suggests a gap in the market.
Peter Godfrey-Smith, Complexity and the Function of Mind in Nature
There is a long-standing theme in biology and philosophy which claims that the presence of complex minds in some organisms may be explained, somehow, by the complexity of those organisms' environments. The first half of this book is an attempt to clarify what, exactly, this "environmental complexity thesis" might mean. This in turn involves clearing a lot of ground, and considering a lot of literature, on topics like the varieties of biological explanation, the meaning of terms like "biological function", "adaptation", "complexity", and even "environment". (To use an example which I think Godfrey-Smith doesn't: Only the most ardent panpsychist would argue that our normal skin bacteria have the same kind of minds that we do, so if the environmental complexity thesis is to have any hope of being true, "environment" had better not just amount to "physical surroundings".) This also leads into thickets like the extent to which organisms construct their environments and ecological niches, and just what that means; the correspondence theory of truth; and why, biologically, true beliefs, or belief-like internal states, might be valuable. Two recurring presences here are John Dewey (no surprise) and Herbert Spencer (a considerable surprise); neither are treated as oracles, but at least as representative figures, and Spencer doesn't appear quite as crude as his later reputation suggests.
If part I reached any firm conclusions about the environmental complexity thesis, I cannot, after having put the book aside for a while, recall what they were.
Part II is a straightforward examination of biological models of phenotypic plasticity, and when it will be favored by natural selection. More particularly, he's interested in when there will be selection for flexible behavior in response to environmental cues (as opposed to just randomly doing different things). None of this is earth-shaking (cf.): the different behaviors need to lead to different fitnesses in different environmental states, and the organism needs to be able to attend to sufficiently-reliable cues that tracking them is generally fitness-enhancing. This means that the models all have a pleasing flavor not just of decision theory but of signal detection. (There are tricky issues about the potential costs of evolving to perceive more reliable cues, or use a given cue more reliably, and mean-variance trade-offs in fitness.) So, does a more complex environment encourage the evolution of complex cognition to deal with it? Sometimes, in these models.
Part II is just as abstract as Part I, but in a strikingly different way: each model is extremely stark, with almost all biological detail ignored as irrelevant, but it is, you should forgive the expression, highly specific abstraction, unlike the generalities of Part I. This, along with "it depends", may be part of Godfrey-Smith's moral to his fellow philosophers.
(Thanks to IB and ZMS for giving me a copy of this book.)
Zoe Sharp, The Fifth Victim
Mind candy. Despite the name, this is the ninth book in the series which began with Killer Instinct. (Skipping over the intervening books was an accident of what I could lay hands on.) I liked the crime-thriller elements, and could tolerate the soap-opera parts of the on-going series.
John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis
Most of the procedures of basic rely on linear models; it's built in to the name of things like "linear regression" or "linear discriminants", and just below the surface with stuff like "principal components analysis" or "logistic regression". This is not because of any deep reason why linear models should work, but sheer computational simplicity. Take linear regression; when we want to make a prediction at a new point $x$, a vector in $p$ dimensions, we have a vector of $p$ coefficients $b$, and our prediction is just the inner product, $x \cdot b$. Even estimating the coefficients is easy. If we bundle all the $x$ vectors we observed into a matrix $\mathbf{x}$, and likewise turn all the responses $y$ into a matrix $\mathbf{y}$, then the ordinary least-squares estimate is $\hat{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y}$, which is just a matter of multiplying and inverting some matrices. Unfortunately, as I said, there is usually no reason to use such models other than their simplicity.
An ancient device (*) to keep some of the computational advantages of linear models while expanding their expressive power is to work not directly with $x$, but with some nonlinear transformation of it, say $\phi(x)$. Here $\phi$ is to be understood to be a function from vectors to vectors. The dimension of $\phi$'s range can be larger, perhaps much larger, than the dimension of $x$. For instance, $\phi$ might calculate all the products of powers of coordinates in $x$, up to some order $d$, which would involve ${p+d-1 \choose p}$ terms. One drawback of this approach is that to get a lot of expressive power and flexibility, the transformation $\phi$ needs to be into a really high-dimensional space. This in turn leads to statistical instability, forces us to estimate and remember lots of coefficients, invert really huge matrices, and compute a large number of nonlinear transformations every time we want to make a prediction. Using an infinite set of transformations would seem to be hopeless.
The "kernel trick" is the recognition that we can often short-cut actually computing the transformations. There are two key steps. The first is to re-write the problem so that instead of looking at a coefficient for each dimension (whether of $x$ or $\phi(x)$), we get a weight $a_i$ for each data point $i$, $x \cdot b = \sum_{i=1}^{n}{ a_i x \cdot \mathbf{x}_{i}}$ Just as we could find the whole $p$-dimensional vector of OLS coefficients, we can find the whole $n$-dimensional vector of weights: $\hat{a} = (\mathbf{x} \mathbf{x}^T)^{-1} \mathbf{y}$ If we replace vectors with their transforms everywhere, we still only have to find $n$ weights, no matter how many dimensions we unfold into.
The second key step is to recognize that in this "dual" form, we can find the weights for the training points, and make the new predictions, using only inner products between data vectors. The actual vectors don't matter, only their inner products. After transformation, we'll be doing things like looking for $\phi(x_i) \cdot \phi(x_j) = \kappa(x_i, x_j)$. If we can find a way to compute the "kernel" $\kappa$ directly, we never need to actually compute $\phi$! In fact, $\phi$ could be infinite-dimensional, so long as there is an efficient way of calculating $\kappa$. As a simple example, take the Gaussian kernel, $\kappa(x_i,x_j) = e^{-\frac{1}{2}\|x_i - x_j\|^2}$. This is clearly easy to calculate, but expanding it in a power series shows that $\phi(x_i)$ maps $x_i$ to infinitely many products of powers of its coordinates, albeit down-weighting the really high powers.
Hence the kernel trick: Take your original linear procedure; re-express it so that it only involves inner products between vectors; replace the inner products with efficiently-computable kernels; and viola, you have a nonlinear statistical method. In fact, it turns out that we can abstract away from an explicit transformation $\phi$ even more than this suggests, because of a powerful result known as Mercer's theorem. This gives conditions under which any function $\kappa(x_i, x_j)$ could have been written $\phi(x_i)\cdot \phi(x_j)$ for some transformation $\phi$. The conditions basically amount to "it wouldn't be perverse to say that $\kappa(x_i, x_j)$ measures how similar $x_i$ and $x_j$ are in some respect". One can therefore think directly about kernels as similarity measures, and leave the higher-dimensional transformation implicit.
Once the kernel trick is grasped, there a couple of directions to go. First, make sure that we haven't lost all statistical reliability in switching from Euclidean inner products to (other) kernels. Since the weights on individual data points are obviously part of our predictive machinery rather than the data source, no one bothers with statistical inference on them. Rather, the goal is to make sure that the kernelized procedure can predict reliably, especially with bounding the error of predictions on new data from the same source. This often involves regularizing the search for optimal weights by adding various penalties.
The second direction to go is, naturally, to find linear statistical procedures and "kernelize" them, taking care that the dual form still presents us with a tractable optimization problem. Hence kernel regression, kernel discriminants, kernel principal components, kernel canonical correlations, kernel logistic regression, ...
The third and final direction is to find kernels which express important notions of similarity in different situations, or for different types of data — not just vectors but text, images, DNA sequences, etc. (One of the great advantages of kernelization is that it adds an abstraction layer, separating "how do we make this sort of prediction?" from "how do we represent the data"?) The kernels must, of course, be functions which we can calculate quickly, and it's helpful to decompose kernels for big, complex data types into kernels for simpler problems, so a lot of attention has been given to formal properties of kernel functions (e.g., when is the product of two kernel functions itself a kernel function?).
This is the standard text on kernel methods, and for good reason. It follows roughly the outline I sketched above: the basic idea of kernelizing linear procedures, their statistical reliability, a slew of different kernel methods, and then kernels for many different types of data, together with formalisms for composing kernels, or extracting them from parametric statistical models. The implied reader seems to be coming from computer science or electrical engineering rather than statistics, but statistics students with a little knowledge of algorithms will find it rewarding. The deeper reaches of the subject (notably reproducing-kernel Hilbert spaces) are carefully skirted.
While the book is over ten years old now, and so misses the last decade of the literature, I think it's fair to say those developments have mostly been technical, especially computational, rather than conceptual or methodological. While a revised edition would be a good thing, I can still recommend this very strongly.
To show just how much of a sloth I am: I picked this up in 2005, read a chunk then, consulted it as a reference from time to time while teaching, and never got around to finishing it until this month.
*: It's mentioned in Norbert Wiener's 1956 paper "Nonlinear Prediction and Dynamics" — a remarkable paper which deserves a post of its own.

Posted at December 31, 2014 23:59 | permanent link