Notebooks

Instrumental Variables

Last update: 13 Dec 2024 15:35
First version: 28 May 2021

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]

(I'll just talk graphical causal models here, because they make more sense to me than alternatives.)

This is a technique of causal inference. The basic logic is as follows. We want to estimate (or test, etc.) the effect of one observable variable \( X \) on another, \( Y \). That is, we want to find \( \Expect{Y|do(X)} \). Unfortunately, we are pretty sure that this effect is confounded; there is some third variable \( U \) which is a causal ancestor of both \( X \) and \( Y \). The "instrument" is a fourth, observable variable, say \( W \), which is (i) an ancestor of \( X \) and (ii) has no (unblocked) paths to \( Y \) except through \( X \). The no-unblocked-paths bit makes it easy for us to estimate both \( \Expect{Y|do(W)} \) and \( \Expect{X|do(W)} \). The trick is then to "back out" or "factor out" \( \Expect{Y|do(X)} \) from these two observationally-identified functions.

If everything's linear, this is pretty straightforward in principle. First, write out the "structural" equations showing how each variable depends on its parents: \[ \begin{eqnarray} X & \leftarrow & \alpha_1 W + \alpha_2 U + \eta\\ Y & \leftarrow & \gamma_1 X + \gamma_2 U + \epsilon \end{eqnarray} \] Substituting the first into the second, we get that \[ Y = \alpha_1 \gamma_1 W + (\alpha_2 \gamma_1 + \gamma_2) U + \gamma_1 \eta + \epsilon \] so the true coefficient of \( Y \) on \( W \) will be \( \alpha_1 \gamma_1 \). But the true coefficient of \( X \) on \( W \) will be \( \alpha_1 \). So just taking the ratio is one way to back out the coefficient we want, which is \( \gamma_1 \). Notice by the way that \[ \Cov{X, Y} = \gamma_1 \Var{X} + \alpha_2 \gamma_2 \Var{U} \] so just regressing \( Y \) on \( X \) will yield a coefficient which we might call \[ \beta = \gamma_1 + \alpha_2 \gamma_2 \frac{\Var{U}}{\Var{X}} ~, \] which can be arbitrarily different from \( \gamma_1 \). (Remember that the optimal linear coefficient for predicting any \( Z \) from any \( W \) is \( \frac{\Cov{Z,W}}{\Var{W}} \), whether or not the true regression function is linear, the direction [if any] of causal relation, etc.)

Alternately, we can do "two-stage least squares". This is where we regress \( Y \) not on \( X \), but on what we'd predict \( X \) to be based on \( W \), namely \( \alpha_1 W \). This, again, will plainly yield the coefficient \( \gamma_1 \).

The last two paragraphs assume everything is linear, but the basic logic doesn't. That logic is: we know \( W \) only affects \( Y \) by first affecting \( X \); we can identify how \( W \) affects \( X \) and how \( W \) affects \( Y \); this has to tell us how the impulse is transmitted through \( X \). What I am particularly interested in are nonparametric methods for instrumental-variable inference, which do not assume linearity.

There is a classic derivation here, which ends up expressing what we want as the solution to an integral equation. (I believe this formulation is due to Darolles et al. but I am writing from memory so I might be off.) Let's abbreviate \( \Expect{Y|do(X=x)} \) as \( f(x) \). The trick is to show that a certain integral transformation of \( f \) can be expressed in terms of observably-identified quantities.

Say that $p(x,w)$ is the joint pdf of \( X \) and \( W \). (Similarly for the related conditional and marginal pdfs, hopefully kept clear by their arguments.) This is an observationally identified quantity. We can thus define \[ t(x,z) \equiv \int{p_{XW}(x, w) p_{XW}(z, w) dw} = \int{p(x|w) p(z|w) p^2(w) dw} \] as a sort of kernel (in the machine-learning sense), expressing something like "how similar are the events \( X=x \) and \( Z=z \), as potential consequences of \( W \)?" We can in fact make this into the kernel of an integral operator on functions of \( x \), \[ (T\psi)(x) = \int{t(z,x) \psi(z) dz} \] Now the claim is that \[ \Expect{\Expect{Y|W} p_{XW}(x, W)} = (Tg)(x) \] This helps us if the operator \( T \) has an inverse, \( T^{-1} \), because then \[ f(x) = \Expect{\Expect{Y|W} (T^{-1} p_{XW})(x, W)} \] (To see this, apply \( T \) to both sides of the last equation above, and remember that \( T \) is by construction a linear operator.)

To verify the claim, start by noticing that we can write \[ Y = f(X) + U + \epsilon \] where without loss of generality \( \Expect{U} = 0 \), but \( \Expect{U|X} \neq 0 \). On the other hand, \( \Expect{U|W} = 0 \), because (in the graphical model we're assuming) \( U \) and \( W \) are both exogeneous, hence independent. So \[ \begin{eqnarray} \Expect{Y|W=w} & = & \Expect{g(X) + U+\epsilon|W=w}\\ & = & \Expect{f(X)|W=w}\\ & = & \int{p(x|w) f(x) dx}\\ & = & \frac{\int{p(x, w) f(x) dx}}{p(w)} \end{eqnarray} \] Thus \[ \begin{eqnarray} \Expect{\Expect{Y|W} p(x,W)} & = & \int{p(w) \Expect{Y|W=w} p(x,w) dw}\\ & = & \int{p(w) p(x,w) \frac{\int{p(z,w) f(z) dz}}{p(w)} dw}\\ & = & \int{\int{f(z) p(z,w) p(x,w) dw dx}}\\ & = & \int{dz g(z) \int{p(z,w) p(x,w) dw}}\\ & = & \int{dz g(z) t(x,z)} \end{eqnarray} \] as desired.

This is one of the places where I follow the math and can use it, but there is something missing from my grasp of it, because it would never occur to me on my own to go through this set of manipulations. In fact I have to look at my notes to remember it right now. (In fact, when I wrote the section of ADAfaEPoV about instrumental variables and integral equations, I worked from memory / trying to derive everything from first principles, and came up with a much simpler approach --- which was quite wrong.) So one thing I would like to do is find some story which makes all this natural. If nothing else, it would help me to teach it!


Notebooks: