Attention conservation notice: 1500 word pedagogical-statistical rant, with sarcasm, mathematical symbols, computer code, and a morally dubious affectation of detachment from the human suffering behind the numbers. Plus the pictures are boring.

Does anyone know when the correlation coefficient is useful, as opposed to when it is used? If so, why not tell us?

— Tukey (1954: 721)

If you have taken any sort of statistics class at all, you have probably
been exposed to the idea of the "proportion of variance explained" by a
regression, conventionally written \( R^2 \). This has two definitions, which
happen to coincide for linear models fit by least squares. The first is to
take the correlation between the model's predictions and the actual values (\(
R \)) and square it (\( R^2 \)), getting a number which is guaranteed to be
between 0 and 1. You get 1 only when the predictions are perfectly correlated
with reality, and 0 when there is no *linear* relationship between them.
The other definition is the ratio of the variance of the predictions to the
variance of the actual values. It is this latter which leads to the notion
that \( R^2 \) is the proportion of variance *explained* by the model.

The use of the word "explained" here is quite unsupported and often actively misleading. Let me go over some examples to indicate why.

Start by supposing that a linear model is true:
\[
Y = a + bX + \epsilon
\]
where the noise \( \epsilon \) has constant variance \( s \), and is
uncorrelated with \( X \). Suppose that we know this is the model to use, and
suppose further that, as a reward for our scrupulous peer-review of anonymous
manuscripts, the Good Fairy of Statistical Modeling
*tells* us the correct values of the parameters \( a \) and \( b \).
Surely, with the right parameters in the right model, our \( R^2 \)
must be very high?

Well, no. The answer depends on the variance of \( X \), which it will be convenient to call \( v \). The variance of the predictions is \( b^2 v \), but the variance of \( Y \) is larger, \( b^2 v + s\). The ratio is \[ R^2 = \frac{b^2 v}{b^2v + s} \] (You can check that this is also the squared correlation between the predictions and \( Y \).) As \( v \) shrinks, this tends to \( 0/s = 0 \). As \( v \) grows, this ratio tends to 1. The relationship between \( X \) and \( Y \) doesn't change, the accuracy and precision with which \( Y \) can be predicted from \( X \) does not change, but \( R^2 \) can wander all through its range, just depending on how dispersed \( X \) is.

Now, you say, this is a silly algebraic curiosity. Never mind the Good Fairy of Statistical Modeling handing us the correct parameters, let's talk about something gritty and real, like death in Chicago.

Number of deaths each day in Chicago, 1 January 1987--31 December 2000, from all causes except accidents. (Click this and all later figures for larger PDF versions. See below for link to code.) |

I can relate deaths to time in any number of ways; the next figure shows what I get when I use a smoothing spline (and use cross-validation to pick how much smoothing to do). The statistical model is \[ \mathrm{death} = f_0(\mathrm{date}) + \epsilon \] with \( f_0 \) being a function learned from the data.

As before, but with the addition of a smoothing spline. |

The root-mean-square error of the smoothing spline is just above 12
deaths/day. The \( R^2 \) of the fit is either 0.35 (squared
correlation between predicted and actual deaths) or 0.33 (variance of predicted
deaths over variance of actual deaths).
It seems absurd, however, to say that *the date* explains how many
people died in Chicago on a given day, or even the variation from day to day.
The closest I can come up with to an example of someone making such a claim
would be an astrologer, and even one of them would work in some patter about
the planets and their influences. (Numerologists, maybe? I dunno.)

Worse is to follow. The same data set which gives me these values for Chicago includes other variables, such as the concentration of various atmospheric pollutants and temperature. I can fit an additive model, which tries to tease out the separate relationships between each of those variables and deaths in Chicago, without presuming a particular functional form for each relationship. In particular I can try the model \[ \mathrm{death} = f_1(\mathrm{sulfur\ dioxide}) + f_2(\mathrm{particulates}) + f_3(\mathrm{temperature},\mathrm{ozone}) + \epsilon \] where the functions \( f_1 \) \( f_2 \) and \( f_3 \) are all learned from data. (Exercise: why do I do a joint smoothing against temperature and ozone?) When I do that, I get functions which look like the following.

The \( R^2 \) of this model is 0.27. Is this "variance explained"? Well, it's at least not incomprehensible to talk about changes in temperature or pollution explaining changes in mortality. In fact, overlaying this model's predictions on the simple spline's, we see that most of what the spline predicted from the date is predictable from pollution and temperature:

Black dots: actual death counts. Red curve: spline smoothing on the date alone. Blue lines: predictions from the temperature-and-pollution model. |

We could, in fact, try to include the date in this larger model: \[ \mathrm{deaths} = f_0(\mathrm{date}) + f_1(\mathrm{sulfur\ dioxide}) + f_2(\mathrm{particulates}) + F_3(\mathrm{temperature}, \mathrm{ozone}) + noise \] Of course, we have to re-estimate all the functions, but as it turns out they don't change very much. (I'd show you the plot of the fitted values over time as well, but visually it's almost indistinguishable from the last one.)

Despite the lack of visual drama, putting a smooth function of time back into the model increases \( R^2 \), from 0.27 to 0.30. Formally, the date enters into the model in exactly the same way as particulate pollution. But, again, only a fortune teller — an unusually numerate fortunate teller, perhaps a subscriber to the Journal of Evidence-Based Haruspicy — would say that the date explains, or helps explain, 3% of the variance.

I hope that by this point you will at least *hesitate* to think or
talk about \( R^2 \) as "the proportion of variance explained". (I
will not *insist* on your never talking that way, because you might need
to speak to the deluded in terms they understand.) How then should you think
about it? I would suggest: the proportion of variance *retained*, or
just *kept*, by the predictions. Linear regression
is a smoothing method. (It just smoothes everything on to a line, or more
generally a hyperplane.) It's hard for any smoother to give fitted values
which have *more* variance than the variable it is
smoothing. \( R^2 \) is merely the fraction of the target's variance
which is not smoothed away.

This of course raises the question of why you'd care about this number at
all. If prediction is your goal, then it would seem much more
natural to look at mean squared error. (Or really root mean squared error, so
it's in the same units as the variable predicted.) Or mean absolute error. Or
median absolute error. Or a genuine loss function. If on the other hand you
want to get some function right, then your question is really about
mis-specification, and/or confidence sets of functions, and not about whether
your smoother is following every last wiggle of the data at all. If you want
an *explanation*, the fact that there is a peak in deaths every year of
about the same height, but the *predictions* fall short of it, suggests
that this model is missing something. The fact that the data shows something
awful happened in 1995 and the model has nothing adequate to say about it
suggests that
whatever's
missing is very important.

Code for
reproducing the figures and analyses in R. (I make this public, despite the
similarity of this exercise to the last problem-set in
advanced data analysis, because (i) it's not exactly the same, (ii) the
homework is due in ten hours, (iii) none of my students would *dream* of
copying this and turning it in as their own, and (iv) I borrowed the example
from Simon
Wood's Generalized
Additive Models.)

*Manual trackback*: Bob O'Hara;
Siris

Posted at February 13, 2012 23:54 | permanent link