## What Is the Right Null Model for Linear Regression?

*11 Apr 2022 12:22*

#### Draft from c. 2000, with typos and mathematical type-setting corrected

When social scientists do linear regressions, they commonly take as their
null hypothesis the model in which all the independent variables have zero
regression coefficients. An independent variable
is usually taken to have a systematic influence on the dependent outcome, or,
more cautiously, some systematic predictive power, if the regression
coefficient is significantly different from zero, generally as assessed by
a *t*-test.

There are a number of things wrong with this picture --- the easy slide from
regression to causation, the assumption of
linearity, the usual assumption of Gaussian noise, etc. --- but what I want to
focus on here is taking the zero-coefficient model as the right null. The
point of the null model, after all, is that it embodies a deflating explanation
of an apparent pattern, that it's somehow due to a boring, uninteresting
mechanism, and any appearance to the contrary is just due to chance. In other
words, testing a null hypothesis should check for a particular kind
of *error*. (See further to that
theme.) A good example of this comes
from evolutionary biology, where the way one
detects adaptation is by first setting up a
"neutral" model, of how evolutionary dynamics would change gene frequencies in
the absence of adaptation, and then looking for deviations from neutrality.
This only works if the neutral model is a good picture of all the non-adaptive
mechanisms shaping gene frequencies. Setting up a trivial null hypothesis
which you are pretty sure is wrong is not actually very informative.

So, the question here is, what is the right null model would be in the kinds
of situations where economists, sociologists, etc., generally use linear
regression. Typically, it seems to me, these are situations where we know
there are an immense number of factors, interacting in complicated ways, but we
get to measure only a very small of highly aggregated variables, few if any of
which would appear in an actual causal model of the process. In this kind of
situation, should we expect variables which are *substantively*
unrelated to have zero regression coefficients? (I emphasize "substantively"
because it's a tautology that an independent variable's regression coefficient
will be zero if and only if it contributes nothing to the optimal linear
predictor formed from the other independent variables.) My intuition is rather
that, unless one is spectacularly lucky in picking the observables, the
regression coefficient should only rarely be zero.

Of course I don't really trust my intuition about things like this. (If it
was intuitively clear then there wouldn't be an issue!) I *think*
however that one could make some headway by tackling the following sort of
model. Suppose that the system we are really concerned with is described by a
very large system of linear equations, with let us say \( N \) variables.
These may be either difference equations (for time evolution) or simultaneous
equations (for equilibria), I'm not sure which would be easier to pursue.
Exogenous terms can also be taken to be noise. So let's say the system is \(
X_{t+1} = \mathbf{a} X_t + \epsilon_{t} \), where \( X_t \) and \( \epsilon_t
\) are \( N \)-dimensional vectors, and \( \mathbf{a} \) is an \( N \times N \)
matrix. Most entries in the system matrix are zero, say only \( O(N) \)
non-zero entries; those we take to be random from some distribution jiggered so
that solutions are sensible. In other words, imagine the distribution of
coefficients to be random, with a big spike of probability mass at zero. If we
could measure these variables, then linear regression would give us sensible
estimates of the parameters of the system (modulo all the qualifiers filling
econometrics textbooks).

The trick, however, is that we don't get to observe them. Suppose then that we get to observe \( K \) variables, each of which is a linear combination of the \( N \) underlying variables, with \( K \ll N \). We form these aggregate observables in some totally dumb way, say uniform random sampling over variables, together with IID standard Gaussian weights, so observable \( Y_{it} = \sum_{C_{ij} X_{jt}} \) where the distribution of the coefficients \( C_{ij} \) also has a spike at zero (but is conditionally Gaussian). We designate one of the \( Y_i \) as the independent variable, again at random.

Questions:

- What is the distribution of regression weights for aggregate quantities, holding the low-level parameters fixed but randomizing over aggregations?
- What is the distribution if we also randomize over the low-level parameters?
- Does either distribution have a mass spike at zero?
- In what limits do the distributions converge to zero?
- What changes in the special case where the K observables are averages (or sums) of M observables for each of K units (so that N=MK)?
- What changes in the special case where the K observables are a sample of the low-level variables, rather than being linear combinations of them?

--- Except in th last case, I don't think this is the same as omitted variable bias. (That is what happens when your specification doesn't include variables it should, but does include ones correlated with it, which induces a bias in the estimate of the coefficients for the included variables.) It may, however, be possible to treat all this using the same machinery.

**Addendum, 2009**: This idea, it occurs to me, is probably related to the Thomson
sampling model (which I've
discussed elsewhere).

**Addendum, 2022**: I'm still interested in this, and think
it'd make a good student project.

- Recommended and relevant:
- Richard A. Berk, Regression Analysis: A Constructive Critique
- Brad DeLong and Kevin Lang, "Are All Economic Hypotheses False?" [PDF]
- Trygve Haavelmo, "The Probability Approach in Econometrics",
Econometrica
**12**(1944): iii--115 [JSTOR] - Paul E. Meehl [What I discuss above is precisely
Meehl's "crud factor", which I didn't know about when I wrote it.]
- "Why Summaries of Research on Psychological Theories
Are Often Uninterpretable", Psychological Reports
**66**(1990): 195--244 [PDF reprint] - "Theory-Testing in Psychology and Physics: A Methodological Paradox", Philosophy of Science
**34**(1967): 103--115 [PDF reprint]

- "Why Summaries of Research on Psychological Theories
Are Often Uninterpretable", Psychological Reports
- Christopher Tosh, Philip Greengard, Ben Goodrich, Andrew Gelman, Aki Vehtari, and Daniel Hsu, "The piranha problem: Large effects swimming in a small pond", arxiv:2105.13445 [On the
impossibility of finding lots of variables having large effects on the
same target variable,
*without*those variables also having large effects on each other]

- To read:
- Peter Spirtes, "Variable Definition and Causal Inference", Proceedings of the 13th International Congress of Logic Methodology and Philosophy of Science, pp. 514--53 PDF reprint via Prof. Spirtes]