Robins and Wasserman Respond to a Nobel Prize Winner

Attention conservation notice: 2500 words of statisticians quarreling with econometricians about arcane points of statistical theory.

Long, long ago, I tried to inveigle Larry into writing this, by promising I would make it a guest post. Larry now has his own blog, but a promise is a promise. More to the point, while I can't claim any credit for it, I'm happy to endorse it, and to pushing it along by reproducing it. Everything between the horizontal lines is by Jamie and Larry, though I tweaked the formatting trivially.

Robins and Wasserman Respond to a Nobel Prize Winner

James Robins and Larry Wasserman

Note: This blog post is written by two people and it is cross-posted at Normal Deviate and Three Toed Sloth.

Chris Sims is a Nobel prize winning economist who is well known for his work on macroeconomics, Bayesian statistics, vector autoregressions among other things. One of us (LW) had the good fortune to meet Chris at a conference and can attest that he is also a very nice guy.

Chris has a paper called On an An Example of Larry Wasserman. This post is a response to Chris' paper.

The example in question is actually due to Robins and Ritov (1997). A simplified version appeared in Wasserman (2004) and Robins and Wasserman (2000). The example is related to ideas from the foundations of survey sampling (Basu 1969, Godambe and Thompson 1976) and also to ancillarity paradoxes (Brown 1990, Foster and George 1996).

1. The Model

Here is (a version of) the example. Consider iid random variables \[ (X_1,Y_1,R_1),\ldots, (X_n,Y_n,R_n). \] The random variables take values as follows: \[ X_i \in [0,1]^d,\ \ \ Y_i \in\{0,1\},\ \ \ R_i \in\{0,1\}. \] Think of \( d \) as being very, very large. For example, \( d=100,000 \) and \( n=1,000 \).

The idea is this: we observe \( X_i \). Then we flip a biased coin \( R_i \). If \( R_i=1 \) then you get to see \( Y_i \). If \( R_i=0 \) then you don't get to see \( Y_i \). The goal is to estimate \[ \psi = P(Y_i=1). \] Here are the details. The distribution takes the form \[ p(x,y,r) = p_X(x) p_{Y|X}(y|x)p_{R|X}(r|x). \] Note that \( Y \) and \( R \) are independent, given \( X \). For simplicity, we will take \( p(x) \) to be uniform on \( [0,1]^d \). Next, let \[ \theta(x) \equiv p_{Y|X}(1|x) = P(Y=1|X=x) \] where \( \theta(x) \) is a function. That is, \( \theta:[0,1]^d \rightarrow [0,1] \). Of course, \[ p_{Y|X}(0|x)= P(Y=0|X=x) = 1-\theta(x). \] Similarly, let \[ \pi(x)\equiv p_{R|X}(1|x) = P(R=1|X=x) \] where \( \pi(x) \) is a function. That is, \( \pi:[0,1]^d \rightarrow [0,1] \). Of course, \[ p_{R|X}(0|x)= P(R=0|X=x) = 1-\pi(x). \]

The function \( \pi \) is known. We construct it. Remember that \( \pi(x) = P(R=1|X=x) \) is the probability that we get to observe \( Y \) given that \( X=x \). Think of \( Y \) as something that is expensive to measure. We don't always want to measure it. So we make a random decision about whether to measure it. And we let the probability of measuring \( Y \) be a function \( \pi(x) \) of \( x \). And we get to construct this function.

Let \( \delta>0 \) be a known, small, positive number. We will assume that \[ \pi(x)\geq \delta \] for all \( x \).

The only thing in the the model we don't know is the function \( \theta(x) \). Again, we will assume that \[ \delta \leq \theta(x) \leq 1-\delta. \] Let \( \Theta \) denote all measurable functions on \( [0,1]^d \) that satisfy the above conditions. The parameter space is the set of functions \( \Theta \).

Let \( {\cal P} \) be the set of joint distributions of the form \[ p(x) \, \pi(x)^r (1-\pi(x))^{1-r}\, \theta(x)^y (1-\theta(x))^{1-y} \] where \( p(x)=1 \), and \( \pi(\cdot) \) and \( \theta(\cdot) \) satisfy the conditions above. So far, we are considering the sub-model \( {\cal P}_\pi \) in which \( \pi \) is known.

The parameter of interest is \( \psi = P(Y=1) \). We can write this as \[ \psi = P(Y=1)= \int_{[0,1]^d} P(Y=1|X=x) p(x) dx = \int_{[0,1]^d} \theta(x) dx. \] Hence, \( \psi \) is a function of \( \theta \). If we know \( \theta(\cdot ) \) then we can compute \( \psi \).

2. Frequentist Analysis

The usual frequentist estimator is the Horwitz-Thompson estimator \[ \hat\psi = \frac{1}{n}\sum_{i=1}^n \frac{ Y_i R_i}{\pi(X_i)}. \] It is easy to verify that \( \hat\psi \) is unbiased and consistent. Furthermore, \( \hat\psi - \psi = O_P(n^{-\frac{1}{2}}) \). In fact, let us define \[ I_n = [\hat\psi - \epsilon_n,\ \hat\psi + \epsilon_n] \] where \[ \epsilon_n = \sqrt{\frac{1}{2n\delta^2}\log\left(\frac{2}{\alpha}\right)}. \] It follows from Hoeffding's inequality that \[ \sup_{P\in{\cal P}_\pi} P(\psi \in I_n)\geq 1-\alpha \] Thus we have a finite sample, \( 1-\alpha \) confidence interval with length \( O(1/\sqrt{n}) \).

Remark: We are mentioning the Horwitz-Thompson estimator because it is simple. In practice, it has three deficiencies:

It may exceed 1.
It ignores data on the multivariate vector \( X \) except for the one dimensional summary \( \pi(X) \).
It can be very inefficient.

These problems are remedied by using an improved version of the Horwitz-Thompson estimator. One choice is the so-called locally semiparametric efficient regression estimator (Scharfstein et al., 1999): \[ \hat\psi = \int {\rm expit}\left(\sum_{m=1}^k \hat\eta_m \phi_m(x) + \frac{\hat\omega}{\pi(x)}\right)dx \] where \( {\rm expit}(a) = e^a/(1+e^a) \), \( \phi_{m}\left( x\right) \) are basis functions, and \( \hat\eta_1,\ldots,\hat\eta_k,\hat\omega \) are the mle's (among subjects with \( R_i=1 \)) in the model \[ \log\left( \frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right) = \sum_{m=1}^k \eta_m \phi_m(x) + \frac{\omega}{\pi(x)}. \] Here \( k \) can increase slowly with \( n. \) Recently even more efficient estimators have been derived. Rotnitzky et al (2012) provides a review. In the rest of this post, when we refer to the Horwitz-Thompson estimator, the reader should think ``improved Horwitz-Thompson estimator.'' End Remark.

3. Bayesian Analysis

To do a Bayesian analysis, we put some prior \( W \) on \( \Theta \). Next we compute the likelihood function. The likelihood for one observation takes the form \( p(x) p(r|x) p(y|x)^r \). The reason for having \( r \) in the exponent is that, if \( r=0 \), then \( y \) is not observed so the \( p(y|x) \) gets left out. The likelihood for \( n \) observations is \[ \prod_{i=1}^n p(X_i) p(R_i|X_i) p(Y_i|X_i)^{R_i} = \prod_i \pi(X_i)^{R_i} (1-\pi(X_i))^{1-R_i}\, \theta(X_i)^{Y_i R_i} (1-\theta(X_i))^{(1-Y_i)R_i}. \] where we used the fact that \( p(x)=1 \). But remember, \( \pi(x) \) is known. In other words, \( \pi(X_i)^{R_i} (1-\pi(X_i))^{1-R_i} \) is known. So, the likelihood is \[ {\cal L} (\theta) \propto \prod_i \theta(X_i)^{Y_i R_i} (1-\theta(X_i))^{(1-Y_i)R_i}. \]

Combining this likelihood with the prior \( W \) creates a posterior distribution on \( \Theta \) which we will denote by \( W_n \). Since the parameter of interest \( \psi \) is a function of \( \theta \), the posterior \( W_n \) for \( \theta \) defines a posterior distribution for \( \psi \).

Now comes the interesting part. The likelihood has essentially no information in it.

To see that the likelihood has no information, consider a simpler case where \( \theta(x) \) is a function on \( [0,1] \). Now discretize the interval into many small bins. Let \( B \) be the number of bins. We can then replace the function \( \theta \) with a high-dimensional vector \( \theta = (\theta_1,\ldots, \theta_B) \). With \( n < B \), most bins are empty. The data contain no information for most of the \( \theta_j \)'s. (You might wonder about the effect of putting a smoothness assumption on \( \theta(\cdot ) \). We'll discuss this in Section 4.)

We should point out that if \( \pi(x) = 1/2 \) for all \( x \), then Ericson (1969) showed that a certain exchangeable prior gives a posterior that, like the Horwitz-Thompson estimator, converges at rate \( O(n^{-1/2}) \). However we are interested in the case where \( \pi(x) \) is a complex function of \( x \); then the posterior will fail to concentrate around the true value of \( \psi \). On the other hand, a flexible nonparametric prior will have a posterior essentially equal to the prior and, thus, not concentrate around \( \psi \), whenever the prior \( W \) does not depend on the the known function \( \pi(\cdot) \). Indeed, we have the following theorem from Robins and Ritov (1997):

Theorem (Robins and Ritov 1997). Any estimator that is not a function of \( \pi(\cdot) \) cannot be uniformly consistent.

This means that, at no finite sample size, will an estimator \( \hat\psi \) that is not a function of \( \pi \) be close to \( \psi \) for all distributions in \( {\cal P} \). In fact, the theorem holds for a neighborhood around every pair \( (\pi,\theta) \). Uniformity is important because it links asymptotic behavior to finite sample behavior. But when \( \pi \) is known and is used in the estimator (as in the Horwitz-Thompson estimator and its improved versions) we can have uniform consistency.

Note that a Bayesian will ignore \( \pi \) since the \( \pi(X_i)'s \) are just constants in the likelihood. There is an exception: the Bayesian can make the posterior be a function of \( \pi \) by choosing a prior \( W \) that makes \( \theta(\cdot) \) depend on \( \pi(\cdot) \). But this seems very forced. Indeed, Robins and Ritov showed that, under certain conditions, any true subjective Bayesian prior \( W \) must be independent of \( \pi(\cdot) \). Specifically, they showed that once a subjective Bayesian queries the randomizer (who selected \( \pi \)) about the randomizer's reasoned opinions concerning \( \theta (\cdot) \) (but not \( \pi(\cdot) \)) the Bayesian will have independent priors. We note that a Bayesian can have independent priors even when he believes with probabilty 1 that \( \pi \left( \cdot \right) \) and \( \theta \left( \cdot \right) \) are positively correlated as functions of \( x \) i.e. \( \int \theta \left( x\right) \pi \left( x\right) dx>\int \theta \left(x\right) dx \) \( \int \pi \left( x\right) dx. \) Having independent priors only means that learning \( \pi \left(\cdot \right) \) will not change one's beliefs about \( \theta \left( \cdot \right) \). So far, so good. As far as we know, Chris agrees with everything up to this point.

4. Some Bayesian Responses

Chris goes on to raise alternative Bayesian approaches.

The first is to define \[ Z_i = \frac{R_i Y_i}{\pi(X_i)}. \] Note that \( Z_i \in \{0\} \cup [1,\infty) \). Now we ignore (throw away) the original data. Chris shows that we can then construct a model for \( Z_i \) which results in a posterior for \( \psi \) that mimics the Horwitz-Thompson estimator. We'll comment on this below, but note two strange things. First, it is odd for a Bayesian to throw away data. Second, the new data are a function of \( \pi(X_i) \) which forces the posterior to be a function of \( \pi \). But as we noted earlier, when \( \theta \) and \( \pi \) are a priori independent, the \( \pi(X_i)'s \) do not appear in the posterior since they are known constants that drop out of the likelihood.

A second approach (not mentioned explicitly by Chris) which is related to the above idea, is to construct a prior \( W \) that depends on the known function \( \pi \). It can be shown that if the prior is chosen just right then again the posterior for \( \psi \) mimics the (improved) Horwitz-Thompson estimator.

Lastly, Chris notes that the posterior contains no information because we have not enforced any smoothness on \( \theta(x) \). Without smoothness, knowing \( \theta(x) \) does not tell you anything about \( \theta(x+\epsilon) \) (assuming the prior \( W \) does not depend on \( \pi \)).

This is true and better inferences would obtain if we used a prior that enforced smoothness. But this argument falls apart when \( d \) is large. (In fairness to Chris, he was referring to the version from Wasserman (2004) which did not invoke high dimensions.) When \( d \) is large, forcing \( \theta(x) \) to be smooth does not help unless you make it very, very, very smooth. The larger \( d \) is, the more smoothness you need to get borrowing of information across different values of \( \theta(x) \). But this introduces a huge bias which precludes uniform consistency.

5. Response to the Response

We have seen that response 3 (add smoothness conditions in the prior) doesn't work. What about response 1 and response 2? We agree that these work, in the sense that the Bayes answer has good frequentist behavior by mimicking the (improved) Horwitz-Thompson estimator.

But this is a Pyrrhic victory. If we manipulate the data to get a posterior that mimics the frequentist answer, is this really a success for Bayesian inference? Is it really Bayesian inference at all? Similarly, if we choose a carefully constructed prior just to mimic a frequentist answer, is it really Bayesian inference?

We call Bayesian inference which is carefully manipulated to force an answer with good frequentist behavior, frequentist pursuit. There is nothing wrong with it, but why bother?

If you want good frequentist properties just use the frequentist estimator. If you want to be a Bayesian, be a Bayesian but accept the fact that, in this example, your posterior will fail to concentrate around the true value.

6. Summary

In summary, we agree with Chris' analysis. But his fix is just frequentist pursuit; it is Bayesian analysis with unnatural manipulations aimed only at forcing the Bayesian answer to be the frequentist answer. This seems to us to be an admission that Bayes fails in this example.

7. References

Basu, D. (1969). Role of the Sufficiency and Likelihood Principles in Sample Survey Theory. Sankya, 31, 441--454.

Brown, L. D. (1990). An ancillarity paradox which appears in multiple linear regression. The Annals of Statistics, 18, 471-493.

Ericson, W. A. (1969). Subjective Bayesian models in sampling finite populations. Journal of the Royal Statistical Society. Series B, 195-233.

Foster, D. P. and George, E. I. (1996). A simple ancillarity paradox. Scandinavian journal of statistics, 233-242.

Godambe, V. P., and Thompson, M. E. (1976), Philosophy of Survey-Sampling Practice. In Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science, eds. W. L.Harper and A. Hooker, Dordrecht: Reidel.

Robins, J. M. and Ritov, Y. (1997). Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models. Statistics in Medicine, 16, 285--319.

Robins, J. and Wasserman, L. (2000). Conditioning, likelihood, and coherence: a review of some foundational concepts. Journal of the American Statistical Association, 95, 1340--1346

Rotnitzky, A., Lei, Q., Sued, M. and Robins, J. M. (2012). Improved double-robust estimation in missing data and causal inference models. Biometrika, 99, 439-456.

Scharfstein, D. O., Rotnitzky, A. and Robins, J.M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 1096-1120.

Sims, Christopher. On An Example of Larry Wasserman. Available at: http://www.princeton.edu/~sims/.

Wasserman, L. (2004). All of Statistics: a Concise Course in Statistical Inference. Springer Verlag.

Update, 29 August: Sims responds in the comments over on Larry's blog (multiple times), plus a counter-response from Jamie, a follow-up post from Larry and Jamie, counter-counter-responses from Sims, and more.

Manual trackback: Brad DeLong

Enigmas of Chance; Bayes, Anti-Bayes; Kith and Kin

Posted at August 28, 2012 08:42 | permanent link

Three-Toed Sloth

August 28, 2012