Books to Read While the Algae Grow in Your Fur, December 2014
Attention conservation notice: I have no taste.
- Mike Carey et al., Unwritten, 1: Tommy Taylor and the Bogus Identity
- Comic book mind candy, in which
the wonderful
power of story-telling is wielded irresponsibly.
- Matt Fraction and Gabriel Bà, Casanova,
1: Luxuria
- Comic book mind candy. I suspect more familiarity on my part with
superhero comics would have revealed even more in-jokes, but it's fun without
them. It also raises a question which I am surprised has not received more
attention from moralists
(ROT-13'd):
vf vg ernyyl vaprfg vs vg'f gur irefvba bs lbhe
fvoyvat sebz n cnenyyry jbeyq naq abg lbhef?
(Actually, is gur pbzovangvba
bs vaprfg naq gvzr-yvar ubccvat fbzr fbeg bs gevohgr gb Zbbepbpx'f Wreel
Pbearyvhf obbxf?
)
- Marisa Acocella Marchetto, Cancer Vixen
- Comic-book memoir of breast cancer.
It's funny and touching, but there were very woo-woo bits which I had to skip
over, since they came far too close to the cancer-as-a-hidden-blessing nonsense
Ehrenreich skewers so well in her book. (Reading the two books so close
together was a coincidence.)
- Mary Sisson, Trang and Trust
- Mind candy: the trials and tribulations of the mid-level Canadian diplomat
who becomes humanity's first accredited representative to the multiple
much-more-advanced-than-us alien species with whom we turn out to share a
wormhole nexus. I took to it like catnip, and really wish there were more. I
especially liked the portions in the second book told from the perspective of
one of the aliens, who has a very different set of senses than we do, and
(ROT-13'd) jung unq ybbxrq
yvxr neovgenel phygheny fghoobeaarff be rira genafyngvba reebef gheaf bhg gb or
cynva ovbybtvpny gehguf.
- Sequel, after seven (!) years. §
- Sebastien de Castell, Traitor's Blade
- Mind candy; the only swords-and-sorcery fantasy I've encountered
about trying to build a functioning state by imposing a uniform system of
justice on an unwilling feudal aristocracy. It ends with, if not a literal
cliff-hanger, then at least our heroes trapped between two hostile armies, and a
truth which has been obvious to the reader for much of the book having just
struck the narrator like a revelation. (Admittedly, he's been kind of
preoccupied.) I'll look for the inevitable sequel.
- Emma Bull, Elizabeth Bear and friends, Shadow Unit
- Mind candy, finally read from start to finish. I liked it, except for the
ending.
Further comment ROT-13'd:
Xvyyvat bss zbfg bs gur grnz jnf svar, gubhtu V gubhtug gur jnl vg
jnf qbar erdhverq gurz gb or fghcvq. (Vs gur rarzl'f fhcre-cbjre vf
znavchyngvat cebonovyvgl, ubj vf fraqvat gur grnz va gb uvf obbol-genccrq ynve
fhcrevbe gb whfg sver-obzovat fnvq ynve?) Yvxrjvfr, gheavat Ivyyrggr vagb n
fbeg bs nabznybvq Rqjneq Fabjqra frrzrq vzcynhfvoyr.
- Barbara Ehrenreich, Brightsided: How the Relentless Promotion of Positive Thinking Has Undermined America
- An excellent corrective to one of our characteristic national weaknesses,
as well as a shrewd debunking of a vile little ideology. I think she
over-reaches a little when she tries to (partly) blame the financial crisis on
positive thinking on the part of corporate leaders, and identify it with
free-market fundamentalism (at root a very different ideology). But over-all
she's sound, she's funny, she's convincing, and she reserves her scorn for
those who actually do harm, rather than their dupes. And the chapter about
Ehrenreich's own experience of being involuntarily plunged into an environment
of 90-proof positive thinking after being diagnosed with breast cancer is very
fine (and angry, and annihilating) writing indeed.
- Fan R. K. Chung, Spectral Graph Theory
- Chung's book is largely about
the spectral
properties of graph Laplacians, roughly evenly divided in coverage between
how to read off geometric or topological properties (e.g., diameter) of
arbitrary graphs from their spectra, and results on the spectra of special
graphs, like ones where all nodes have the same degree,
or expanders. There
is also an interesting chapter on "quasi-randomness" in graphs, i.e.,
properties which guarantee that a particular graph will in many ways look like
a typical realization of an Erdos-Renyi process.
- What drove me to Chung's book was the importance of spectral graph theory
for clustering, specifically spectral clustering, and related methods of
dimensionality reduction. I learned a lot from it --- admittedly
much of it stuff I should have known already --- but there is actually
rather little about those particular applications here. This suggests a gap in
the market.
- Peter Godfrey-Smith, Complexity and the Function of Mind in Nature
- There is a long-standing theme in biology and philosophy which claims that
the presence of complex minds in some organisms may be explained, somehow, by
the complexity of those organisms' environments. The first half of this book
is an attempt to clarify what, exactly, this "environmental complexity thesis"
might mean. This in turn involves clearing a lot of ground, and considering a
lot of literature, on topics like the varieties of biological explanation, the
meaning of terms like "biological function", "adaptation", "complexity", and
even "environment". (To use an example which I think Godfrey-Smith doesn't:
Only the most ardent panpsychist would argue that our normal skin bacteria have
the same kind of minds that we do, so if the environmental complexity thesis is
to have any hope of being true, "environment" had better not just amount to
"physical surroundings".) This also leads into thickets like the extent to
which organisms construct their environments and ecological niches, and just
what that means; the correspondence theory of truth; and why, biologically,
true beliefs, or belief-like internal states, might be valuable. Two recurring
presences here are John Dewey
(no surprise) and Herbert Spencer (a considerable
surprise); neither are treated as oracles, but at least as representative
figures, and Spencer doesn't appear quite as crude as his later reputation
suggests.
- If part I reached any firm conclusions about the environmental complexity
thesis, I cannot, after having put the book aside for a while, recall what they
were.
- Part II is a straightforward examination of biological models of phenotypic
plasticity, and when it will be favored by natural selection. More
particularly, he's interested in when there will be selection for flexible
behavior in response to environmental cues (as opposed to just randomly doing
different things). None of this is earth-shaking (cf.):
the different behaviors need to lead to different fitnesses in different
environmental states, and the organism needs to be able to attend to
sufficiently-reliable cues that tracking them is generally fitness-enhancing.
This means that the models all have a pleasing flavor not just of decision
theory but of signal detection. (There are tricky issues about the potential
costs of evolving to perceive more reliable cues, or use a given cue more
reliably, and mean-variance trade-offs in fitness.) So, does a more complex
environment encourage the evolution of complex cognition to deal with it?
Sometimes, in these models.
- Part II is just as abstract as Part I, but in a strikingly different way:
each model is extremely stark, with almost all biological detail
ignored as irrelevant, but it is, you should forgive the expression,
highly specific abstraction, unlike the generalities of Part I.
This, along with "it depends", may be part of Godfrey-Smith's moral to
his fellow philosophers.
- (Thanks to IB and ZMS for giving me a copy of this book.)
- Zoe Sharp, The Fifth Victim
- Mind candy. Despite the name, this is the ninth book in the series which
began with Killer Instinct.
(Skipping over the intervening books was an accident of what I laid hands on.)
I liked the crime-thriller elements, and could tolerate the soap-opera parts of
the on-going series.
- John Shawe-Taylor and Nello Cristianini, Kernel Methods for Pattern Analysis
- Most of the procedures of basic statistics rely on linear models; it's
built in to the name of things like "linear regression" or "linear
discriminants", and just below the surface with stuff like "principal
components analysis" or "logistic regression". This is not because of any deep
reason why linear models should work, but sheer computational simplicity. Take
linear regression; when we want to make a prediction at a new point $x$, a
vector in $p$ dimensions, we have a vector of $p$ coefficients $b$, and our
prediction is just the inner product, $x \cdot b$. Even estimating the
coefficients is easy. If we bundle all the $x$ vectors we observed into a
matrix $\mathbf{x}$, and likewise turn all the responses $y$ into a matrix
$\mathbf{y}$, then the ordinary least-squares estimate is $\hat{b} =
(\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y}$, which is just a matter
of multiplying and inverting some matrices. Unfortunately, as I said, there is
usually no reason to use such models other than their simplicity.
- An ancient device (*) to keep some of the computational advantages of linear
models while expanding their expressive power is to work not directly with $x$,
but with some nonlinear transformation of it, say $\phi(x)$. Here $\phi$ is to
be understood to be a function from vectors to vectors. The dimension of
$\phi$'s range can be larger, perhaps much larger, than the dimension of $x$.
For instance, $\phi$ might calculate all the products of powers of coordinates
in $x$, up to some order $d$, which
would involve
${p+d-1 \choose p}$ terms. One drawback of this approach is that to get a
lot of expressive power and flexibility, the transformation $\phi$ needs to be
into a really high-dimensional space. This in turn leads to statistical
instability, forces us to estimate and remember lots of coefficients, invert
really huge matrices, and compute a large number of nonlinear transformations
every time we want to make a prediction. Using an infinite set of
transformations would seem to be hopeless.
- The "kernel trick" is the recognition that we can often short-cut actually
computing the transformations. There are two key steps. The first is to
re-write the problem so that instead of looking at a coefficient for each
dimension (whether of $x$ or $\phi(x)$), we get a weight $a_i$ for each
data point $i$,
\[
x \cdot b = \sum_{i=1}^{n}{ a_i x \cdot \mathbf{x}_{i}}
\]
Just as we could find the whole $p$-dimensional vector of OLS coefficients, we
can find the whole $n$-dimensional vector of weights:
\[
\hat{a} = (\mathbf{x} \mathbf{x}^T)^{-1} \mathbf{y}
\]
If we replace vectors with their transforms everywhere, we still only have to
find $n$ weights, no matter how many dimensions we unfold into.
- The second key step is to recognize that in this "dual" form, we can find
the weights for the training points, and make the new predictions,
using only inner products between data vectors. The actual vectors
don't matter, only their inner products. After transformation, we'll be doing
things like looking for $\phi(x_i) \cdot \phi(x_j) = \kappa(x_i, x_j)$. If we
can find a way to compute the "kernel" $\kappa$ directly, we never need to
actually compute $\phi$! In fact, $\phi$ could be infinite-dimensional, so
long as there is an efficient way of calculating $\kappa$. As a simple
example, take the Gaussian kernel, $\kappa(x_i,x_j) = e^{-\frac{1}{2}\|x_i -
x_j\|^2}$. This is clearly easy to calculate, but expanding it in a power
series shows that $\phi(x_i)$ maps $x_i$ to infinitely many products of powers
of its coordinates, albeit down-weighting the really high powers.
- Hence the kernel trick: Take your original linear procedure; re-express it
so that it only involves inner products between vectors; replace the inner
products with efficiently-computable kernels; and viola, you have a nonlinear
statistical method. In fact, it turns out that we can abstract away from an
explicit transformation $\phi$ even more than this suggests, because of a
powerful result known as Mercer's theorem. This gives conditions under which
any function $\kappa(x_i, x_j)$ could have been written $\phi(x_i)\cdot
\phi(x_j)$ for some transformation $\phi$. The conditions basically amount to
"it wouldn't be perverse to say that $\kappa(x_i, x_j)$ measures how similar
$x_i$ and $x_j$ are in some respect". One can therefore think directly
about kernels as similarity measures, and leave the higher-dimensional
transformation implicit.
- Once the kernel trick is grasped, there a couple of directions to go.
First, make sure that we haven't lost all statistical reliability in switching
from Euclidean inner products to (other) kernels. Since the weights on
individual data points are obviously part of our predictive machinery
rather than the data source, no one bothers with statistical inference on them.
Rather, the goal is to make sure that the kernelized procedure
can predict reliably, especially with bounding the error of
predictions on new data from the same source. This often involves regularizing
the search for optimal weights by adding various penalties.
- The second direction to go is, naturally, to find linear statistical
procedures and "kernelize" them, taking care that the dual form still
presents us with a tractable optimization problem. Hence kernel
regression, kernel discriminants, kernel principal components, kernel
canonical correlations, kernel logistic regression, ...
- The third and final direction is to find kernels which express important
notions of similarity in different situations, or for different types of data
— not just vectors but text, images, DNA sequences, etc. (One of the
great advantages of kernelization is that it adds an abstraction layer,
separating "how do we make this sort of prediction?" from "how do we represent
the data"?) The kernels must, of course, be functions which we can calculate
quickly, and it's helpful to decompose kernels for big, complex data types into
kernels for simpler problems, so a lot of attention has been given to formal
properties of kernel functions (e.g., when is the product of two kernel
functions itself a kernel function?).
- This is the standard text on kernel methods, and for good reason. It
follows roughly the outline I sketched above: the basic idea of kernelizing
linear procedures, their statistical reliability, a slew of different kernel
methods, and then kernels for many different types of data, together with
formalisms for composing kernels, or extracting them from parametric
statistical models. The implied reader seems to be coming from computer
science or electrical engineering rather than statistics, but statistics
students with a little knowledge of algorithms will find it rewarding. The
deeper reaches of the subject (notably reproducing-kernel Hilbert
spaces) are carefully skirted.
- While the book is over ten years old now, and so misses the last decade of
the literature, I think it's fair to say those developments have mostly been
technical, especially computational, rather than conceptual or methodological.
While a revised edition would be a good thing, I can still recommend this very
strongly.
- To show just how much of a sloth I am: I picked
this up in 2005, read a chunk then, consulted it as a reference from time to
time while teaching, and never got around to finishing it until this
month.
- *: It's mentioned
in Norbert
Wiener's 1956
paper "Nonlinear Prediction and Dynamics" — a remarkable paper which
deserves a post of its own.
Books to Read While the Algae Grow in Your Fur;
Scientifiction and Fantastica;
Enigmas of Chance;
Pleasures of Detection, Portraits of Crime
Philosophy;
Biology;
Minds, Brains, and Neurons;
Complexity;
Psychoceramica;
The Beloved Republic;
Mathematics;
Networks
Posted at December 31, 2014 23:59 | permanent link