Notebooks

What Is the Right Null Model for Linear Regression?

11 Apr 2022 12:22

Draft from c. 2000, with typos and mathematical type-setting corrected

When social scientists do linear regressions, they commonly take as their null hypothesis the model in which all the independent variables have zero regression coefficients. An independent variable is usually taken to have a systematic influence on the dependent outcome, or, more cautiously, some systematic predictive power, if the regression coefficient is significantly different from zero, generally as assessed by a t-test.

There are a number of things wrong with this picture --- the easy slide from regression to causation, the assumption of linearity, the usual assumption of Gaussian noise, etc. --- but what I want to focus on here is taking the zero-coefficient model as the right null. The point of the null model, after all, is that it embodies a deflating explanation of an apparent pattern, that it's somehow due to a boring, uninteresting mechanism, and any appearance to the contrary is just due to chance. In other words, testing a null hypothesis should check for a particular kind of error. (See further to that theme.) A good example of this comes from evolutionary biology, where the way one detects adaptation is by first setting up a "neutral" model, of how evolutionary dynamics would change gene frequencies in the absence of adaptation, and then looking for deviations from neutrality. This only works if the neutral model is a good picture of all the non-adaptive mechanisms shaping gene frequencies. Setting up a trivial null hypothesis which you are pretty sure is wrong is not actually very informative.

So, the question here is, what is the right null model would be in the kinds of situations where economists, sociologists, etc., generally use linear regression. Typically, it seems to me, these are situations where we know there are an immense number of factors, interacting in complicated ways, but we get to measure only a very small of highly aggregated variables, few if any of which would appear in an actual causal model of the process. In this kind of situation, should we expect variables which are substantively unrelated to have zero regression coefficients? (I emphasize "substantively" because it's a tautology that an independent variable's regression coefficient will be zero if and only if it contributes nothing to the optimal linear predictor formed from the other independent variables.) My intuition is rather that, unless one is spectacularly lucky in picking the observables, the regression coefficient should only rarely be zero.

Of course I don't really trust my intuition about things like this. (If it was intuitively clear then there wouldn't be an issue!) I think however that one could make some headway by tackling the following sort of model. Suppose that the system we are really concerned with is described by a very large system of linear equations, with let us say \( N \) variables. These may be either difference equations (for time evolution) or simultaneous equations (for equilibria), I'm not sure which would be easier to pursue. Exogenous terms can also be taken to be noise. So let's say the system is \( X_{t+1} = \mathbf{a} X_t + \epsilon_{t} \), where \( X_t \) and \( \epsilon_t \) are \( N \)-dimensional vectors, and \( \mathbf{a} \) is an \( N \times N \) matrix. Most entries in the system matrix are zero, say only \( O(N) \) non-zero entries; those we take to be random from some distribution jiggered so that solutions are sensible. In other words, imagine the distribution of coefficients to be random, with a big spike of probability mass at zero. If we could measure these variables, then linear regression would give us sensible estimates of the parameters of the system (modulo all the qualifiers filling econometrics textbooks).

The trick, however, is that we don't get to observe them. Suppose then that we get to observe \( K \) variables, each of which is a linear combination of the \( N \) underlying variables, with \( K \ll N \). We form these aggregate observables in some totally dumb way, say uniform random sampling over variables, together with IID standard Gaussian weights, so observable \( Y_{it} = \sum_{C_{ij} X_{jt}} \) where the distribution of the coefficients \( C_{ij} \) also has a spike at zero (but is conditionally Gaussian). We designate one of the \( Y_i \) as the independent variable, again at random.

Questions:

  1. What is the distribution of regression weights for aggregate quantities, holding the low-level parameters fixed but randomizing over aggregations?
  2. What is the distribution if we also randomize over the low-level parameters?
  3. Does either distribution have a mass spike at zero?
  4. In what limits do the distributions converge to zero?
  5. What changes in the special case where the K observables are averages (or sums) of M observables for each of K units (so that N=MK)?
  6. What changes in the special case where the K observables are a sample of the low-level variables, rather than being linear combinations of them?

--- Except in th last case, I don't think this is the same as omitted variable bias. (That is what happens when your specification doesn't include variables it should, but does include ones correlated with it, which induces a bias in the estimate of the coefficients for the included variables.) It may, however, be possible to treat all this using the same machinery.

Addendum, 2009: This idea, it occurs to me, is probably related to the Thomson sampling model (which I've discussed elsewhere).

Addendum, 2022: I'm still interested in this, and think it'd make a good student project.


Notebooks: