The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi 161 \[ \DeclareMathOperator*{\argmin}{argmin} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Statistics for High-Dimensional Data

Methods, Theory and Applications

by Peter Bühlmann and Sara van de Geer

New York and Berlin: Springer-Verlag, 2011

The Plethora of Predictors

The workhorse problem of applied statistics is regression. You have one variable, \( y \), whose value you want to guess from another variable \( x \), the predictor. Importantly, \( x \) is generally a vector, say of dimension \( p \), so \( x = (x^{(1)}, \ldots x^{(p)}) \). You have \( n \) observations of predictor-response pairs, \( (x_1, y_1), \ldots (x_n, y_n) \). From these, you want to guess a function \( f \) such that \( f(x_i) \) will be close to \( y_i \). Generally, in fact, you want to guess an \( f \) which won't just approximate the data but will also predict well, i.e., where \( f(X) \) will tend to be close to \( Y \) for new, random values of \( X \) and \( Y \), at least when those are produced by the same process as the training data.

The classical solution, which in various forms and guises now goes back about two centuries, runs as follows. Define "close" as the mean of squared errors, \( n^{-1}\sum_{i=1}^{n}{{\left(f(x_i) - y_i\right)}^2} \). Restrict yourself to linear functions of \( x \), ones where \( f(x) = b \cdot x \) for some \( p \)-dimensional vector of coefficients \( b \). Then the optimal, least-squares value of \( b \), say \( \hat{\beta} \), is an optimization problem: \[ \hat{\beta} = \argmin_{b \in \mathbb{R}^{p}}{\frac{1}{n}\sum_{i=1}^{n}{{\left(y_i - b \cdot x_i\right)}^2}} \] \( \hat{\beta} \) can be found through a few steps of linear algebra, and calculated efficiently, by multiplying and inverting some matrices¹. Under fairly mild assumptions about the data-generating process (much weaker than positing Gaussian distributions), the least-squares estimate \( \hat{\beta} \) converges on the optimal linear prediction function, \( \beta \), the one which will perform best out of sample².

This can be extended in many ways: the response variable \( y \) can itself be made into a vector; the response can be the next observation of the predictor \( x \); \( y \) can be made into a categorical variable, and the linear function of \( x \) can be used to predict the probabilities of different categories. One can even incorporate some kinds of nonlinearity by replacing \( x \) with its image after applying a library of transformations, so that one regresses \( y \) on \( r = (r_1(x), r_2(x) \ldots r_p(x) \). (One then has to pick the transformations \( r_j \) somehow, which is an art in itself.) It is an altogether tidy and attractive scheme, and it is not surprising that it has by some people (cough econometricians cough) been confused with the whole of statistics or even the whole of data analysis.

One key limitation to the scheme, however, has to do with the relation between the number of training observations \( n \) and the dimension of the predictor variable \( p \). If there are more observations than predictors, \( n > p \), then there is (in general) a well-defined least-squares solution for the coefficients. If, however, there are more coefficients to be found than there are data points, \( p > n \), then the least-squares problem is ill-posed, and there is no unique solution³. Rather, there are infinitely many solutions, all of which fit the training data equally well, leaving no clear way to chose between them. This is the fundamental challenge of high-dimensional statistics.

I imagine that if any of the Ancestors who invented linear regression had been asked for advice about the \( p > n \) situation, their reply would have been on the order of "You should have such troubles". But the advances of instrumentation, and, as usual, computation, have got us to the point where it is now routine. Last year, for instance, a graduate student in my department had to work on a data set, about genetic risks for organ-transplant rejection, where \( n \), the number of patients, was several hundred, but \( p \), the number of genetic measurements ("single-nucleotide polymorphisms") per patient, was about 500,000. There was, of course, no prospect of re-running the same study only with a thousand or ten thousand times as many patients. Similar situations arise not just with genetics but with image data (and so astronomy, neuroscience, climatology...), the large databases our electronic lives generate, and so forth. Yet it seems absurd to throw up one's hands and say that we can learn nothing because our data is too rich and too detailed.

Statisticians have not, of course, thrown up their hands. At the level of mathematical theory, it's clear that any approach which has a hope of working will have to impose additional structure on the regression estimates, going beyond what's dictated by the data, the linearity assumption and the least-squares principle --- the estimates will have to be "regularized" somehow. Our estimates will then be some compromise between the data and the biases of the regularization scheme, but we can hope to design regularization schemes which can accommodate lots of different data-generating processes, and which will adaptively decide how much regularization is needed to get good predictions⁴.

This book is largely devoted to studying one of the best and most-widely used of the regularization schemes, the "lasso"⁵, introduced by Robert Tibshirani in 1996. The idea here is to look for the coefficient vector \( b \) which minimizes the squared errors, but subject to a constraint on how big \( b \) can get. Specifically, the "\( L_1 \) norm", which is just the sum of the absolute values \( \left|b_1\right| + \ldots \left|b_p\right| \), is constrained to be no larger than some pre-set value, say \( c \). Symbolically, \[ \begin{eqnarray} \tilde{\beta} & = & \argmin_{b \in \mathbb{R}^p}{\frac{1}{n}\sum_{i=1}^{n}{\left( y_i - b \cdot x_i\right)^2}}\\ & & \mathrm{such\ that} ~ \sum_{j=1}^{p}{\left|b_j\right|} \leq c \end{eqnarray} \] This does three things for us:

It successfully regularizes the problem: we get a unique solution to the constrained least squares problem, not the infinity of unconstrained pseudo-solutions⁶.
It induces "sparsity": the constrained solutions tend to be ones where some, maybe lots, of the entries in the coefficient vector go exactly to zero. This is usually illustrated through a figure like the following. Here I have made up some data⁷ where \( p = 2 \) (and \( n = 100 \), so there is not much regularization that needs doing, but it shows the point). The grey curves are contour lines for the squared error, descending to the unconstrained optimum marked by the plus sign. The black diamonds are contours of the \( L_1 \) constraint, centered on the origin (marked by small black circle). As one can see here for the case where the constraint level \( c \) is small, the constrained optimum tends to be at a corner of the diamond, which means that one of the coefficients is set to exactly zero. Even the non-zero coefficients are "shrunk" towards zero by the constraint, and this continues for larger values of the constraint (here say \( c = 2 \)) which do not shrink either coefficient all the way to zero. Eventually, the constraint becomes so weak that the unconstrained optimum is recovered (here, \( c = 2.5 \)). In a really high-dimensional situation, with \( p > n\), we would never get more than \( n \) non-zero coefficients, no matter how much we relaxed the constraint.
There are lots of situations where sparsity is going to be a good feature of a statistical model, because while we have lots of variables, most of them are actually irrelevant for predicting the response. (Some of those 500,000 genetic variants have to do with, say, the ability to digest milk as an adult, with no effect on transplant rejection.) If we knew which ones those were, we'd just ignore them, so having our method zero out lots of coefficients automatically seems like a step towards letting the data sort it out. (More cautiously, many predictor variables will be irrelevant in a linear approximation; or, yet more cautiously, make such a small and hard-to-discern contribution to the prediction that, in a noisy world, one is better off ignoring them.)
The constrained solution can be computed efficiently; this is crucial. After all, if sparsity is so good, why not directly constrain the number of non-zero coefficients? That is, decide that one is willing to work with, say, 10 non-zero coefficients, and then minimize the sum of squares among all such models? There are several reasons, but the decisive one is the computational complexity. There are about \( 2.6\times 10^{50} \) ways to pick 10 variables from 500,000, and no algorithm for finding the best is much faster than going through them one after the other. With the \( L_1 \) constraint, on the other hand, there are very fast solutions. The trick is to use Lagrange multipliers to turn constrained optimization into penalized optimization, with the penalty term being, precisely, the sum of absolute values of the coefficients. \[ \begin{eqnarray*} \tilde{\beta} & = & \argmin_{b \in \mathbb{R}^p}{\frac{1}{n}\sum_{i=1}^{n}{\left( y_i - b\cdot x_i\right)^2} - \lambda\left(c-\sum_{j=1}^{p}{|b_j|}\right)}\\ & = & \argmin_{b \in \mathbb{R}^p}{\frac{1}{n}\sum_{i=1}^{n}{\left( y_i - b\cdot x_i\right)^2} + \lambda\sum_{j=1}^{p}{|b_j|}} \end{eqnarray*} \] The Lagrange multiplier \( \lambda \) is the "shadow price" of the constraint, the rate at which the squared error would decrease if the constraint were relaxed. A high penalty \( \lambda \) thus corresponds to a tight constraint level \( c \), and when the constraint becomes so weak that the unconstrained optimum is recovered, \( \lambda \) hits zero. The new, penalized objective function is not quite so trivial minimize as least squares, but it is still a very well-behaved ("convex") function, for which there are fast and stable numerical optimization procedures.
To see how this works, consider the next figure, where the grey contours are now those of the penalized, least-squares-plus-\( L_1 \)-norm, function, with \( \lambda \) set, arbitrarily, equal to 1. (This corresponds, in this case, to a constraint level \( c \) of about 1.5) Simply going downhill, along these contour lines, will quickly get us to the minimum, though there are also cleverer optimization methods.

Trying to use the same tricks with a penalty proportional to the number of non-zero coefficients, on the other hand, gets us nowhere.

So: using the lasso gives us regularized regression estimates that are stable, sparse, and efficiently computable. But are they any good?

There are (at least) three things we could want from our regression function. The first is that it should give us good, low-error predictions on new data. The second is that its estimated coefficients should be close to the true coefficients, in either an absolute-value or a mean-square sense. (If the truth isn't linear, then the estimates should be close to the best linear approximation to the truth, what the econometricians wonderfully call the "pseudo-truth".) The third is that when the truth (or pseudo-truth) really is sparse, the estimate should be sparse too, and should select the right variables to have non-zero coefficients.

These three tasks --- prediction, consistent estimates, and variable selection --- are increasingly difficult. The lasso can, under the right conditions, do all three, but those conditions become more and more restrictive.

The starting point is to decide how much to regularize, i.e., how tight to make the constraint level \( c \) (or, equivalently, the penalty term lambda). This is usually done through cross-validation: randomly divide the data into two parts, fit multiple versions of the model, with different values of the constraint, to one part, and then see how well each of those models predicts the other part of the data⁸. Too much constraint and the model "underfits", ignoring predictable patterns in the data; too little constraint and the model "overfits", basically using its coefficients to memorize the details of the noise in the data. At some intermediate level between under-fitting and over-fitting we get useful generalization. Cross-validation is intuitively compelling because it directly emulates what the model needs to do in order to predict. When the goal is prediction, it works very well in practice, and even (though only recently) in theory.

While there are certainly situations where prediction is all we want out of a model, the second level of inference is to demand that our estimates of the coefficients converge on the truth (or at least the pseudo-truth). The immediate difficulty is that cross-validation decides on constraints, and so parameter estimates, to get good predictions, not good estimates. There are a range of tricks here, such as first using the lasso to decide which variables should have non-zero coefficients, and then doing ordinary least-squares estimation (without regularization or shrinkage) for their coefficients. Under some not-too-onerous conditions, the lasso, and this kind of variant on the lasso, will in fact converge on the correct coefficients as \( n \) grows to infinity. To be more precise, the distance between the estimated coefficient vector \( \tilde{\beta} \) and the true coefficient vector \( \beta \) will shrink to zero.

The third level of inference (at least here!) is the even more delicate problem of variable selection. Suppose that there are only a few non-zero entries in the true coefficient vector. How well do they match up with the coefficients the lasso fails to zero out? Presumably, there will always be some risk of zeroing-out a coefficient which should have been included, or of failing to shrink away the coefficient of a truly irrelevant variable. But the hope is that when the number of truly non-zero, relevant coefficients is very small, so that the solution is sparse, it will in some sense be easy for the lasso to discover those variables. Including a modest number of variables in the estimate which are really irrelevant might be a small price to pay — if only 10 of the student's 500,000 genetic measurements truly mattered, it would be quite good to select those 10 and another 10 irrelevant ones, but disturbing to only select 9 in total.

Formally, we say that the lasso is "sparsistent" when the set of predictors it selects converges on the non-zero entries of the optimal linear predictor \( \beta \). The lasso is definitely sparsistent in the low-dimensional, \( p < n \) setting, say where \( p \) is fixed and \( n \rightarrow \infty \). What is interesting is that in the high-dimensional setting, if \( p \) grows not-to-much-faster than \( n \), the lasso can still be sparsistent. This requires some additional conditions on the predictor variables, which are all highly technical but all amount to "the predictors are not too correlated with each other". It is not at all clear whether it is possible to do better than these awkward and hard-to-check conditions.

The reader may have noticed that I have said nothing so far about the book supposedly under review. This is because what I have said is just a simplified summary of chapters 2--9. (The first chapter is mostly an introductory tour.) Chapters 2--5 and 9 are largely methodological; they lay out the methods, they illustrate them with case studies, and they give imprecise arguments for why the methods are reasonable. The precise theory comes in chapters 6--8, which are concerned with predictive accuracy, parametric consistency, and variable selection.

Chapters 10 and 11 turn away from the lasso and its kin, to consider issues of stability and statistical significance in variable selection, closely following recent work by Bühlmann and collaborators. Chapter 12 is a very nice treatment of boosting, where one uses an ensemble of highly-biased and low-capacity, but very stable, models to compensate for each other's faults. The treatment of boosting here has remarkably little overlap with that of Freund and Schapire's monograph.

Chapter 13, finally, turns to graphical models, especially Gaussian graphical models, looking at ways of inferring the graph based on the lasso principle, on local regression, and, even more closely, the PC algorithm of P. Spirtes and C. Glymour. (This chapter draws on work by work by Kalisch and Bühlmann on how the PC algorithm works in the high-dimensional regime.) Causal inference is an important application of graphical models, but it is, perhaps wisely, not discussed.

The core theoretical chapters (6--8) are much rougher going than the more method-oriented ones, but that's just the nature of the material. (Incidentally, the stark contrast between the tools and concepts used in this book and what one finds in, say, Casella and Berger is a good illustration of how theoretical statistics has been shaped by intuitions about low-dimensional problems which serve us poorly in the high-dimensional regime.) These chapters use empirical process techniques extensively, which is not surprising considering that van de Geer wrote the book on empirical process theory in estimation, but it also makes considerable sense. Chapter 14, really a kind of appendix, collects the necessary concepts and results from empirical process theory proper; it is formally self-contained, but probably some prior exposure would be helpful.

I know of no better, more up-to-date summary of current theoretical knowledge about high-dimensional regression (whether linear, generalized linear, additive, ... ), and how it connects to practical methods. It could be used as a textbook, but for very advanced students; it's really better suited to self-study. For that, however, I can recommend it highly to anyone with a serious interest in the area.

Disclaimer: both authors are the kind of people who might easily get asked to review my tenure application; Bühlmann is also the editor of Annals of Statistics, where I'm supposed to publish. It would be reckless of me to say negative things about their book; but if I thought it was bad, I could keep silent.

1: Specifically, if we stack the predictor vectors into an \( n \times p \) matrix \( \mathbf{x} \), and the responses into an \( n \times 1 \) matrix \( \mathbf{y} \), then \[ \hat{\beta} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} \] I will leave deriving this formula as an exercise for readers who care. I will notice, however, that if all the variables have been "centered", so that they have mean zero, then \( \frac{1}{n} \mathbf{x}^T \mathbf{x} \) is the sample covariance matrix for the predictor variables, and \( \frac{1}{n} \mathbf{x}^T \mathbf{y} \) is the sample covariances between the predictors and the response. ^

2: The optimal linear predictor would minimize the expected squared error, so \[ \beta = \argmin_{b \in \mathbb{R}^p}{\Expect{(Y- b\cdot X)^2}} \] Again (see previous note), a bit of calculus and linear algebra shows that \[ \beta = \left(\mathrm{cov}[X,X]\right)^{-1} \mathrm{cov}[X,Y] \] where, to be explicit, \( \mathrm{cov}[X,X] \) is the \( p \times p \) covariance matrix of the predictor variables, and \( \mathrm{cov}[X,Y] \) is similarly the \( p \times 1 \) matrix of covariances between the predictors and the response. Since \( \hat{\beta} \) just uses the sample covariances instead of the true covariances, \( \hat{\beta} \rightarrow \beta \) so long as sample covariance converges on the true covariance.

Notice that while \( \beta \) is the optimal linear predictor, even the best linear predictor may be horrible. ^

3: To go into a little more detail, the problem is that one must be able to invert the matrix of correlations among the components of \( x \). (See previous notes.) If there are fewer observations than dimensions of \( x \), our estimate of that correlation matrix will only have \( n \) linearly-independent rows, not \( p \) of them, so it will be "rank deficient", and consequently it has no inverse. The same problem can arise even if \( n > p \), when there are exact correlations among two (or more) of the components of \( x \) ("collinearity"). ^

4: Explicit modeling of the relationships between the different components in \( x \), and how they relate to the response \( y \), is in this sense a sort of regularization scheme. The problem is that "science is hard", and a sound investigation of (for instance) how each of those 500,000 genetic variants does or does not causally contribute to the risk of tissue transplant rejection would require approximately 500,000 graduate students in experimental molecular biology, not one statistics student. ^

5: Supposedly an acronym for "Least Absolute Shrinkage and Selection Operator". If you believe that phrase came first, perhaps you could help me get some money out of Afghanistan? ^

6: Actually, strictly speaking the lasso can still have multiple solutions when \( p > n \). This is however vanishingly unlikely when the variables are continuous; see Osborne et al. (2000) [ungated] and Tibshirani (2012). ^

7: The R:

x <- matrix(rnorm(200),nrow=100) y <- (x %*% c(2,1))+ rnorm(100,sd=0.05) mse <- function(b1,b2) {mean((y- x %*% c(b1,b2))^2)} coef.seq <- seq(from=-1,to=5,length.out=200) m <- outer(coef.seq,coef.seq,Vectorize(mse)) l1 <- function(b1,b2) {abs(b1)+abs(b2)} l1.levels <- outer(coef.seq,coef.seq,l1) ols.coefs <- coefficients(lm(y~0+x)) contour(x=coef.seq,y=coef.seq,z=m,drawlabels=FALSE,nlevels=30,col="grey") contour(x=coef.seq,y=coef.seq,z=l1.levels,nlevels=20,add=TRUE) points(x=ols.coefs[1],y=ols.coefs[2],pch="+") points(0,0)

To make the second figure, I did a contour plot of m+l1.levels. ^

8: Doing just one random split is noisy, so the usual procedure is "\( k \)-fold cross-validation": divide the data at random into \( k \) equally-sized parts, where each of the \( k \) "folds" is in turn predicted with models fit to the other \( k-1 \) folds; the selected value of the constraint is the one which does best when averaged over all \( k \) folds. Typical values of \( k \) are 5 or 10; if the exact \( k \) makes a big difference, you've got bigger troubles than picking the number of folds. The older "leave-one-out" cross-validation is the limiting case where \( k=n \). (The famous Akaike information criterion is best understood as a fast, large-\( n \) approximation to leave-one-out CV.) ^

Hardback, ISBN 9783642201912

Probability and Statistics

Finished 2 August 2012; updated 14 September 2012 (thanks to Georg M. Goerg for spotting an error in the code!) and 15 February 2013 (thanks to Darren Homrighausen for pointing out a think-o).