The workhorse problem of applied statistics is regression. You have one variable, \( y \), whose value you want to guess from another variable \( x \), the predictor. Importantly, \( x \) is generally a vector, say of dimension \( p \), so \( x = (x^{(1)}, \ldots x^{(p)}) \). You have \( n \) observations of predictor-response pairs, \( (x_1, y_1), \ldots (x_n, y_n) \). From these, you want to guess a function \( f \) such that \( f(x_i) \) will be close to \( y_i \). Generally, in fact, you want to guess an \( f \) which won't just approximate the data but will also predict well, i.e., where \( f(X) \) will tend to be close to \( Y \) for new, random values of \( X \) and \( Y \), at least when those are produced by the same process as the training data.

The classical solution, which in various forms and guises now goes back
about two centuries, runs as follows. Define "close" as the mean of squared
errors, \( n^{-1}\sum_{i=1}^{n}{{\left(f(x_i) - y_i\right)}^2} \). Restrict
yourself to linear functions of \( x \), ones where \( f(x) = b \cdot x \) for
some \( p \)-dimensional vector of coefficients \( b \). Then the optimal,
least-squares value of \( b \), say \( \hat{\beta} \), is an optimization
problem:
\[
\hat{\beta} = \argmin_{b \in \mathbb{R}^{p}}{\frac{1}{n}\sum_{i=1}^{n}{{\left(y_i - b \cdot x_i\right)}^2}}
\]
\( \hat{\beta} \) can be found through a few steps of linear algebra, and
calculated efficiently, by multiplying and inverting some
matrices^{1}. Under fairly mild
assumptions about the data-generating process (much weaker than positing
Gaussian distributions), the least-squares estimate \( \hat{\beta} \) converges
on the optimal linear prediction function, \( \beta \), the one which will
perform best out of sample^{2}.

This can be extended in many ways: the response variable \( y \) can itself
be made into a vector; the response can be the next observation of the
predictor \( x \); \( y \) can be made into a categorical variable, and the
linear function of \( x \) can be used to predict the probabilities of
different categories. One can even incorporate some kinds of nonlinearity by
replacing \( x \) with its image after applying a library of transformations,
so that one regresses \( y \) on \( r = (r_1(x), r_2(x) \ldots r_p(x) \). (One
then has to pick the transformations \( r_j \) somehow, which
is an
art in itself.) It is an altogether tidy and attractive scheme, and it is
not surprising that it has by some people (*cough*
econometricians *cough*) been confused with the whole of statistics or
even the whole of data analysis.

One key limitation to the scheme, however, has to do with the relation
between the number of training observations \( n \) and the dimension of the
predictor variable \( p \). If there are more observations than predictors, \(
n > p \), then there is (in general) a well-defined least-squares solution for
the coefficients. If, however, there are more coefficients to be found than
there are data points, \( p > n \), then the least-squares problem is
ill-posed, and there is no unique solution^{3}. Rather, there are infinitely many solutions, all of
which fit the training data equally well, leaving no clear way to chose between
them. This is the fundamental challenge of high-dimensional statistics.

I imagine that if any of the Ancestors who invented linear regression had been asked for advice about the \( p > n \) situation, their reply would have been on the order of "You should have such troubles". But the advances of instrumentation, and, as usual, computation, have got us to the point where it is now routine. Last year, for instance, a graduate student in my department had to work on a data set, about genetic risks for organ-transplant rejection, where \( n \), the number of patients, was several hundred, but \( p \), the number of genetic measurements ("single-nucleotide polymorphisms") per patient, was about 500,000. There was, of course, no prospect of re-running the same study only with a thousand or ten thousand times as many patients. Similar situations arise not just with genetics but with image data (and so astronomy, neuroscience, climatology...), the large databases our electronic lives generate, and so forth. Yet it seems absurd to throw up one's hands and say that we can learn nothing because our data is too rich and too detailed.

Statisticians have not, of course, thrown up their hands. At the level of
mathematical theory, it's clear that any approach which has a hope of working
will have to impose additional structure on the regression estimates, going
beyond what's dictated by the data, the linearity assumption and the
least-squares principle --- the estimates will have to be "regularized"
somehow. Our estimates will then be some compromise between the data and the
biases of the regularization scheme, but we can hope to design regularization
schemes which can accommodate lots of different data-generating processes, and
which will adaptively decide how much regularization is needed to get good
predictions^{4}.

This book is largely devoted to studying one of the best and most-widely
used of the regularization schemes, the "lasso"^{5}, introduced
by Robert Tibshirani in
1996. The idea here is to look for the coefficient vector \( b \) which
minimizes the squared errors, but subject to a constraint on how big \( b \)
can get. Specifically, the "\( L_1 \) norm", which is just the sum of the
absolute values \( \left|b_1\right| + \ldots \left|b_p\right| \), is
constrained to be no larger than some pre-set value, say \( c \).
Symbolically,
\[
\begin{eqnarray}
\tilde{\beta} & = & \argmin_{b \in \mathbb{R}^p}{\frac{1}{n}\sum_{i=1}^{n}{\left( y_i - b \cdot x_i\right)^2}}\\
& & \mathrm{such\ that} ~ \sum_{j=1}^{p}{\left|b_j\right|} \leq c
\end{eqnarray}
\]
This does three things for us:

- It successfully regularizes the problem: we get a unique solution to the
constrained least squares problem, not the infinity of unconstrained
pseudo-solutions
^{6}. - It induces "sparsity": the constrained solutions tend to be ones where
some, maybe lots, of the entries in the coefficient vector go exactly to zero.
This is usually illustrated through a figure
like the following. Here I have made up some
data
^{7}where \( p = 2 \) (and \( n = 100 \), so there is not much regularization that needs doing, but it shows the point). The grey curves are contour lines for the squared error, descending to the unconstrained optimum marked by the plus sign. The black diamonds are contours of the \( L_1 \) constraint, centered on the origin (marked by small black circle).As one can see here for the case where the constraint level \( c \) is small, the constrained optimum tends to be at a corner of the diamond, which means that one of the coefficients is set to exactly zero. Even the non-zero coefficients are "shrunk" towards zero by the constraint, and this continues for larger values of the constraint (here say \( c = 2 \)) which do not shrink either coefficient all the way to zero. Eventually, the constraint becomes so weak that the unconstrained optimum is recovered (here, \( c = 2.5 \)). In a really high-dimensional situation, with \( p > n\), we would never get more than \( n \) non-zero coefficients, no matter how much we relaxed the constraint. There are lots of situations where sparsity is going to be a good feature of a statistical model, because while we have lots of variables, most of them are actually irrelevant for predicting the response. (Some of those 500,000 genetic variants have to do with, say, the ability to digest milk as an adult, with no effect on transplant rejection.) If we knew which ones those were, we'd just ignore them, so having our method zero out lots of coefficients automatically seems like a step towards letting the data sort it out. (More cautiously, many predictor variables will be irrelevant in a linear approximation; or, yet more cautiously, make such a small and hard-to-discern contribution to the prediction that, in a noisy world, one is better off ignoring them.)

- The constrained solution can be computed efficiently; this is crucial.
After all, if sparsity is so good, why not directly constrain the number of
non-zero coefficients? That is, decide that one is willing to work with, say,
10 non-zero coefficients, and then minimize the sum of squares among all such
models? There are several reasons, but the decisive one is the computational
complexity. There are about \( 2.6\times 10^{50} \) ways to pick 10 variables
from 500,000, and no algorithm for finding the best is much faster than going
through them one after the other. With the \( L_1 \) constraint, on the other
hand, there
*are*very fast solutions. The trick is to use Lagrange multipliers to turn constrained optimization into penalized optimization, with the penalty term being, precisely, the sum of absolute values of the coefficients. \[ \begin{eqnarray*} \tilde{\beta} & = & \argmin_{b \in \mathbb{R}^p}{\frac{1}{n}\sum_{i=1}^{n}{\left( y_i - b\cdot x_i\right)^2} - \lambda\left(c-\sum_{j=1}^{p}{|b_j|}\right)}\\ & = & \argmin_{b \in \mathbb{R}^p}{\frac{1}{n}\sum_{i=1}^{n}{\left( y_i - b\cdot x_i\right)^2} + \lambda\sum_{j=1}^{p}{|b_j|}} \end{eqnarray*} \] The Lagrange multiplier \( \lambda \) is the "shadow price" of the constraint, the rate at which the squared error would decrease if the constraint were relaxed. A high penalty \( \lambda \) thus corresponds to a tight constraint level \( c \), and when the constraint becomes so weak that the unconstrained optimum is recovered, \( \lambda \) hits zero. The new, penalized objective function is not quite so trivial minimize as least squares, but it is still a very well-behaved ("convex") function, for which there are fast and stable numerical optimization procedures.To see how this works, consider the next figure, where the grey contours are now those of the penalized, least-squares-plus-\( L_1 \)-norm, function, with \( \lambda \) set, arbitrarily, equal to 1. (This corresponds, in this case, to a constraint level \( c \) of about 1.5) Simply going downhill, along these contour lines, will quickly get us to the minimum, though there are also cleverer optimization methods.

Trying to use the same tricks with a penalty proportional to the number of non-zero coefficients, on the other hand, gets us nowhere.

So: using the lasso gives us regularized regression estimates that are stable, sparse, and efficiently computable. But are they any good?

There are (at least) three things we could want from our regression function. The first is that it should give us good, low-error predictions on new data. The second is that its estimated coefficients should be close to the true coefficients, in either an absolute-value or a mean-square sense. (If the truth isn't linear, then the estimates should be close to the best linear approximation to the truth, what the econometricians wonderfully call the "pseudo-truth".) The third is that when the truth (or pseudo-truth) really is sparse, the estimate should be sparse too, and should select the right variables to have non-zero coefficients.

These three tasks --- prediction, consistent estimates, and variable selection --- are increasingly difficult. The lasso can, under the right conditions, do all three, but those conditions become more and more restrictive.

The starting point is to decide how much to regularize, i.e., how tight to
make the constraint level \( c \) (or, equivalently, the penalty term lambda).
This is usually done
through cross-validation:
randomly divide the data into two parts, fit multiple versions of the model,
with different values of the constraint, to one part, and then see how well
each of those models predicts the other part of the data^{8}. Too much constraint and the model "underfits", ignoring
predictable patterns in the data; too little constraint and the model
"overfits", basically using its coefficients to memorize the details of the
noise in the data. At some intermediate level between under-fitting and
over-fitting we get useful generalization. Cross-validation is intuitively
compelling because it directly emulates what the model needs to do in order to
predict. When the goal is prediction, it works very well in practice, and even
(though only recently) in theory.

While there are certainly situations where prediction is all we want out of
a model, the second level of inference is to demand that our estimates of the
coefficients converge on the truth (or at least the pseudo-truth). The
immediate difficulty is that cross-validation decides on constraints, and so
parameter estimates, to get good *predictions*, not
good *estimates*. There are a range of tricks here, such as first using
the lasso to decide which variables should have non-zero coefficients, and then
doing ordinary least-squares estimation (without regularization or shrinkage)
for their coefficients. Under some not-too-onerous conditions, the lasso, and
this kind of variant on the lasso, will in fact converge on the correct
coefficients as \( n \) grows to infinity. To be more precise, the distance
between the estimated coefficient vector \( \tilde{\beta} \) and the true
coefficient vector \( \beta \) will shrink to zero.

The third level of inference (at least here!) is the even more delicate
problem of variable selection. Suppose that there are only a few non-zero
entries in the true coefficient vector. How well do they match up with the
coefficients the lasso fails to zero out? Presumably, there will always
be *some* risk of zeroing-out a coefficient which should have been
included, or of failing to shrink away the coefficient of a truly irrelevant
variable. But the hope is that when the number of truly non-zero, relevant
coefficients is very small, so that the solution is sparse, it will in some
sense be easy for the lasso to discover those variables. Including a modest
number of variables in the estimate which are really irrelevant might be a
small price to pay — if only 10 of the student's 500,000 genetic
measurements truly mattered, it would be quite good to select those 10 and
another 10 irrelevant ones, but disturbing to only select 9 in total.

Formally, we say that the lasso is "sparsistent" when the set of predictors
it selects converges on the non-zero entries of the optimal linear predictor \(
\beta \). The lasso is definitely sparsistent in the low-dimensional, \( p < n
\) setting, say where \( p \) is fixed and \( n \rightarrow \infty \). What is
interesting is that in the high-dimensional setting, if \( p \) grows
not-to-much-faster than \( n \), the lasso can still be sparsistent. This
requires some additional conditions on the predictor variables, which are all
highly
technical but all amount to "the predictors are not *too* correlated
with each other". It is not at all clear whether it is possible to do
better than these awkward and hard-to-check conditions.

The reader may have noticed that I have said nothing so far about the book supposedly under review. This is because what I have said is just a simplified summary of chapters 2--9. (The first chapter is mostly an introductory tour.) Chapters 2--5 and 9 are largely methodological; they lay out the methods, they illustrate them with case studies, and they give imprecise arguments for why the methods are reasonable. The precise theory comes in chapters 6--8, which are concerned with predictive accuracy, parametric consistency, and variable selection.

Chapters 10 and 11 turn away from the lasso and its kin, to consider issues of stability and statistical significance in variable selection, closely following recent work by Bühlmann and collaborators. Chapter 12 is a very nice treatment of boosting, where one uses an ensemble of highly-biased and low-capacity, but very stable, models to compensate for each other's faults. The treatment of boosting here has remarkably little overlap with that of Freund and Schapire's monograph.

Chapter 13, finally, turns to graphical models, especially Gaussian graphical models, looking at ways of inferring the graph based on the lasso principle, on local regression, and, even more closely, the PC algorithm of P. Spirtes and C. Glymour. (This chapter draws on work by work by Kalisch and Bühlmann on how the PC algorithm works in the high-dimensional regime.) Causal inference is an important application of graphical models, but it is, perhaps wisely, not discussed.

The core theoretical chapters (6--8) are much rougher going than the more method-oriented ones, but that's just the nature of the material. (Incidentally, the stark contrast between the tools and concepts used in this book and what one finds in, say, Casella and Berger is a good illustration of how theoretical statistics has been shaped by intuitions about low-dimensional problems which serve us poorly in the high-dimensional regime.) These chapters use empirical process techniques extensively, which is not surprising considering that van de Geer wrote the book on empirical process theory in estimation, but it also makes considerable sense. Chapter 14, really a kind of appendix, collects the necessary concepts and results from empirical process theory proper; it is formally self-contained, but probably some prior exposure would be helpful.

I know of no better, more up-to-date summary of current theoretical
knowledge about high-dimensional regression (whether linear,
generalized
linear, additive, ... ), and
how it connects to practical methods. It *could* be used as a textbook,
but for very advanced students; it's really better suited to self-study. For
that, however, I can recommend it highly to anyone with a serious interest in
the area.

*Disclaimer*: both authors are the kind of people who might easily
get asked to review my tenure application; Bühlmann is also the editor
of Annals of Statistics, where I'm supposed to publish. It would
be reckless of me to say negative things about their book; but if I thought it
was bad, I could keep silent.

1: Specifically, if we stack the predictor vectors into an \( n \times p \) matrix \( \mathbf{x} \), and the responses into an \( n \times 1 \) matrix \( \mathbf{y} \), then \[ \hat{\beta} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} \] I will leave deriving this formula as an exercise for readers who care. I will notice, however, that if all the variables have been "centered", so that they have mean zero, then \( \frac{1}{n} \mathbf{x}^T \mathbf{x} \) is the sample covariance matrix for the predictor variables, and \( \frac{1}{n} \mathbf{x}^T \mathbf{y} \) is the sample covariances between the predictors and the response. ^

2: The optimal linear predictor would minimize the *expected* squared error, so
\[
\beta = \argmin_{b \in \mathbb{R}^p}{\Expect{(Y- b\cdot X)^2}}
\]
Again (see previous note), a bit of calculus and linear algebra shows that
\[
\beta = \left(\mathrm{cov}[X,X]\right)^{-1} \mathrm{cov}[X,Y]
\]
where, to be explicit, \( \mathrm{cov}[X,X] \) is the \( p \times p \)
covariance matrix of the predictor variables, and \( \mathrm{cov}[X,Y] \) is
similarly the \( p \times 1 \) matrix of covariances between the predictors and
the response. Since \( \hat{\beta} \) just uses the sample covariances instead
of the true covariances, \( \hat{\beta} \rightarrow \beta \) so long as sample
covariance converges on the true covariance.

Notice that while \( \beta \) is the optimal *linear* predictor,
even the best linear predictor may be horrible. ^

3: To go into a little more detail, the problem is that one must be able to invert the matrix of correlations among the components of \( x \). (See previous notes.) If there are fewer observations than dimensions of \( x \), our estimate of that correlation matrix will only have \( n \) linearly-independent rows, not \( p \) of them, so it will be "rank deficient", and consequently it has no inverse. The same problem can arise even if \( n > p \), when there are exact correlations among two (or more) of the components of \( x \) ("collinearity"). ^

4: Explicit modeling of the relationships between the different components in \( x \), and how they relate to the response \( y \), is in this sense a sort of regularization scheme. The problem is that "science is hard", and a sound investigation of (for instance) how each of those 500,000 genetic variants does or does not causally contribute to the risk of tissue transplant rejection would require approximately 500,000 graduate students in experimental molecular biology, not one statistics student. ^

5: Supposedly an acronym for "Least Absolute Shrinkage and Selection Operator". If you believe that phrase came first, perhaps you could help me get some money out of Afghanistan? ^

6: Actually, strictly speaking the lasso
can *still* have multiple solutions when \( p > n \). This is however
vanishingly unlikely when the variables are continuous;
see Osborne *et al.*
(2000)
[ungated] and
Tibshirani (2012).
^

7: The R:

To make the second figure, I did a contour plot ofx <- matrix(rnorm(200),nrow=100)

y <- (x %*% c(2,1))+ rnorm(100,sd=0.05)

mse <- function(b1,b2) {mean((y- x %*% c(b1,b2))^2)}

coef.seq <- seq(from=-1,to=5,length.out=200)

m <- outer(coef.seq,coef.seq,Vectorize(mse))

l1 <- function(b1,b2) {abs(b1)+abs(b2)}

l1.levels <- outer(coef.seq,coef.seq,l1)

ols.coefs <- coefficients(lm(y~0+x))

contour(x=coef.seq,y=coef.seq,z=m,drawlabels=FALSE,nlevels=30,col="grey")

contour(x=coef.seq,y=coef.seq,z=l1.levels,nlevels=20,add=TRUE)

points(x=ols.coefs[1],y=ols.coefs[2],pch="+")

points(0,0)

8: Doing just one random split is noisy, so the usual procedure is "\( k \)-fold cross-validation": divide the data at random into \( k \) equally-sized parts, where each of the \( k \) "folds" is in turn predicted with models fit to the other \( k-1 \) folds; the selected value of the constraint is the one which does best when averaged over all \( k \) folds. Typical values of \( k \) are 5 or 10; if the exact \( k \) makes a big difference, you've got bigger troubles than picking the number of folds. The older "leave-one-out" cross-validation is the limiting case where \( k=n \). (The famous Akaike information criterion is best understood as a fast, large-\( n \) approximation to leave-one-out CV.) ^

Hardback, ISBN 9783642201912

Finished 2 August 2012; updated 14 September 2012 (thanks to Georg M. Goerg for spotting an error in the code!) and 15 February 2013 (thanks to Darren Homrighausen for pointing out a think-o).