## What Is the Right Null Model for Linear Regression?

*16 Aug 2011 21:44*

When social scientists do linear regressions, they commonly take as their
null hypothesis the model in which all the independent variables have zero
regression coefficients. An independent variable
is usually taken to have a systematic influence on the dependent outcome, or,
more cautiously, some systematic predictive power, if the regression
coefficient is significantly different from zero, generally as assessed by
a *t*-test.

There are a number of things wrong with this picture --- the easy slide from
regression to causation, the assumption of
linearity, the usual assumption of Gaussian noise, etc. --- but what I want to
focus on here is taking the zero-coefficient model as the right null. The
point of the null model, after all, is that it embodies a deflating explanation
of an apparent pattern, that it's somehow due to a boring, uninteresting
mechanism, and any appearance to the contrary is just due to chance. In other
words, testing a null hypothesis should check for a particular kind
of *error*. (See further to that
theme.) A good example of this comes
from evolutionary biology, where the way one
detects adaptation is by first setting up a
"neutral" model, of how evolutionary dynamics would change gene frequencies in
the absence of adaptation, and then looking for deviations from neutrality.
This only works if the neutral model is a good picture of all the non-adaptive
mechanisms shaping gene frequencies. Setting up a trivial null hypothesis
which you are pretty sure is wrong is not actually very informative.

So, the question here is, what is the right null model would be in the kinds
of situations where economists, sociologists, etc., generally use linear
regression. Typically, it seems to me, these are situations where we know
there are an immense number of factors, interacting in complicated ways, but we
get to measure only a very small of highly aggregated variables, few if any of
which would appear in an actual causal model of the process. In this kind of
situation, should we expect variables which are *substantively*
unrelated to have zero regression coefficients? (I emphasize "substantively"
because it's a tautology that an independent variable's regression coefficient
will be zero if and only if it contributes nothing to the optimal linear
predictor formed from the other independent variables.) My intuition is rather
that, unless one is spectacularly lucky in picking the observables, the
regression coefficient should only rarely be zero.

Of course I don't really trust my intuition about things like this. (If it
was intuitively clear then there wouldn't be an issue!) I *think*
however that one could make some headway by tackling the following sort of
model. Suppose that the system we are really concerned with is described by a
very large system of linear equations, with let us say N variables. These may
be either difference equations (for time evolution) or simultaneous equations
(for equilibria), I'm not sure which would be easier to pursue. Exogenous
terms can also be taken to be noise. Most entries in the N*N matrix describing
the system are zero, say only O(N) entries are non-zero; those we take to be
random from some distribution jiggered so that solutions are sensible. In
other words, imagine the distribution of coefficients to be random, with a big
spike of probability mass at zero. If we could measure these variables, then
linear regression would give us sensible estimates of the parameters of the
system (modulo all the qualifiers filling econometrics textbooks).

The trick, however, is that we don't get to observe them. Suppose then that we get to observe K variables, each of which is a linear combination of the N underlying variables; K is much smaller than N. We form these aggregate observables in some totally dumb way, say uniform random sampling over variables, together with IID standard Gaussian weights. We designate one of them as the independent variable, again at random.

Questions:

- What is the distribution of regression weights for aggregate quantities, holding the low-level parameters fixed but randomizing over aggregations?
- What is the distribution if we also randomize over the low-level parameters?
- Does either distribution have a mass spike at zero?
- In what limits do the distributions converge to zero?
- What changes in the special case where the K observables are averages (or sums) of K observables for each of M units (so that N=MK)?
- What changes in the special case where the K observables are a sample of the low-level variables, rather than being linear combinations of them?

— Except in th last case, I don't think this is the same as omitted variable bias. (That is what happens when your specification doesn't include variables it should, but does include ones correlated with it, which induces a bias in the estimate of the coefficients for the included variables.) It may, however, be possible to treat all this using the same machinery.

— This idea, it occurs to me, is probably related to the Thomson sampling model (which I've discussed elsewhere).