February 21, 2016

On the Uncertainty of the Bayesian Estimator

Attention conservation notice: A failed attempt at a dialogue, combining the philosophical sophistication and easy approachability of statistical theory with the mathematical precision and practical application of epistemology, dragged out for 2500+ words (and equations). You have better things to do than read me vent about manuscripts I volunteered to referee.

Scene: We ascend, by a dubious staircase, to the garret loft space of Confectioner-Stevedore Hall, at Robberbaron-Bloodmoney University, where we find two fragments of the author's consciousness, temporarily incarnated as perpetual post-docs from the Department of Statistical Data Science, sharing an unheated office.

Q: Are you unhappy with the manuscript you're reviewing?

A: Yes, but I don't see why you care.

Q: The stabbing motions of your pen are both ostentatious and distracting. If I listen to you rant about it, will you go back to working without the semaphore?

A: I think that just means you find it easier to ignore my words than anything else, but I'm willing to try.

Q: So, what is getting you worked up about the manuscript?

A: They take a perfectly reasonable — though not obviously appropriate-to-the-problem — regularized estimator, and then go through immense effort to Bayesify it. They end up with about seven levels of hierarchical priors. Simple Metropolis-Hastings Monte Carlo would move as slowly as a continental plate, so they put vast efforts into speeding it up, and in a real technical triumph they get something which moves like a glacier.

Q: Isn't that rather fast these days?

A: If they try to scale up, my back-of-the-envelope calculation suggests they really will enter the regime where each data set will take a single Ph.D. thesis to analyze.

Q: So do you think that they're just masochists who're into frequentist pursuit, or do they have some reason for doing all these things that annoy you?

A: Their fondness for tables over figures does give me pause, but no, they claim to have a point. If they do all this work, they say, they can use their posterior distributions to quantify uncertainty in their estimates.

Q: That sounds like something statisticians should want to do. Haven't you been very pious about just that, about how handling uncertainty is what really sets statistics apart from other traditions of data analysis? Haven't I heard you say to students that they don't know anything until they know where the error bars go?

A: I suppose I have, though I don't recall that exact phrase. It's not the goal I object to, it's the way quantification of uncertainty is supposed to follow automatically from using Bayesian updating.

Q: You have to admit, the whole "posterior probability distribution over parameter values" thing certainly looks like a way of expressing uncertainty in quantitative form. In fact, last time we went around about this, didn't you admit that Bayesian agents are uncertain about parameters, though not about the probabilities of observable events?

A: I did, and they are, though that's very different from agreeing that they quantify uncertainty in any useful way — that they handle uncertainty well.

Q: Fine, I'll play the straight man and offer a concrete proposal for you to poke holes in. Shall we keep it simple and just consider parametric inference?

A: By all means.

Q: Alright, then, I start with some prior probability distribution over a finite-dimensional vector-valued parameter \( \theta \), say with density \( \pi(\theta) \). I observe \( x \) and have a model which gives me the likelihood \( L(\theta) = p(x;\theta) \), and then my posterior distribution is fixed by \[ \pi(\theta|X=x) \propto L(\theta) \pi(\theta) \] This is my measure-valued estimate. If I want a set-valued estimate of \( \theta \), I can fix a level \( \alpha \) and chose a region \( C_{\alpha} \) with \[ \int_{C_{\alpha}}{\pi(\theta|X=x) d\theta} = \alpha \] Perhaps I even preferentially grow \( C_{\alpha} \) around the posterior mode, or something like that, so it looks pretty. How is \( C_{\alpha} \) not a reasonable way of quantifying my uncertainty about \( \theta \)?

A: To begin with, I don't know the probability that the true \( \theta \in C_{\alpha} \).

Q: How is it not \( \alpha \), like it says right there on the label?

A: Again, I don't understand what that means.

Q: Are you attacking subjective probability? Is that where this is going? OK: sometimes, when a Bayesian agent and a bookmaker love each other very much, the bookie will offer the Bayesian bets on whether \( \theta \in C_{\alpha} \), and the agent will be indifferent so long as the odds are \( \alpha : 1-\alpha \). And even if the bookie is really a damn dirty Dutch gold-digger, the agent can't be pumped dry of money. What part of this do you not understand?

A: I hardly know where to begin. I will leave aside the color commentary. I will leave aside the internal issues with Dutch book arguments for conditionalization. I will not pursue the fascinating, even revealing idea that something which is supposedly a universal requirement of rationality needs such very historically-specific institutions and ideas as money and making book and betting odds for its expression. The important thing is that you're telling me that \( \alpha \), the level of credibility or confidence, is really about your betting odds.

Q: Yes, and?

A: I do not see why should I care about the odds at which you might bet. It's even worse than that, actually, I do not see why I should care about the odds at which a machine you programmed with the saddle-blanket prior (or, if we were doing nonparametrics, an Afghan jirga process prior) would bet. I fail to see how those odds help me learn anything about the world, or even reasonably-warranted uncertainties in inferences about the world.

Q: May I indulge in mythology for a moment?

A: Keep it clean, students may come by.

Q: That leaves out all the best myths, but very well. Each morning, when woken by rosy-fingered Dawn, the goddess Tyche picks \( \theta \) from (what else?) an urn, according to \( \pi(\theta) \). Tyche then draws \( x \) from \( p(X;\theta) \), and \( x \) is revealed to us by the Sibyl or the whisper of oak leaves or sheep's livers. Then we calculate \( \pi(\theta|X=x) \) and \( C_{\alpha} \). In consequence, the fraction of days on which \( \theta \in C_{\alpha} \) is about \( \alpha \). \( \alpha \) is how often the credible set is right, and \( 1-\alpha \) is one of those error rates you like to go on about. Does this myth satisfy you?

A: Not really. I get that "Bayesian analysis treats the parameters as random". In fact, that myth suggests a very simple yet universal Monte Carlo scheme for sampling from any posterior distribution whatsoever, without any Markov chains or burn-in.

Q: Can you say more?

A: I should actually write it up. But now let's try to de-mythologize. I want to know what happens if we get rid of Tyche, or at least demote her from resetting \( \theta \) every day to just picking \( x \) from \( p(x;\theta) \), with \( \theta \) fixed by Zeus.

Q: I think you mean Ananke, Zeus would meddle with the parameters to cheat on Hera. Anyway, what do you think happens?

A: Well, \( C_{\alpha} \) depends on the data, it's really \( C_{\alpha}(x) \). Since \( x \) is random, \( X \sim p(\cdot;\theta) \), so is \( C_{\alpha} \). It follows a distribution of its own, and we can ask about \( Pr_{\theta}(\theta \in C_{\alpha}(X) ) \).

Q: Haven't we just agreed that that probability is just \( \alpha \) ?

A: No, we've seen that \[ \int{Pr_{\theta}(\theta \in C_{\alpha}(X) ) \pi(\theta) d\theta} = \alpha \] but that is a very different thing.

Q: How different could it possibly be?

A: As different as we like, at any particular \( \theta \).

Q: Could the 99% credible sets contain \( \theta \) only, say, 1% of the time?

A: Absolutely. This is the scenario of Larry's playlet, but he wrote that up because it actually happened in a project he was involved in.

Q: Isn't it a bit artificial to worry about the long-run proportion of the time you're right about parameters?

A: The same argument works if you estimate many parameters at once. When the brain-imaging people do fMRI experiments, they estimate how tens of thousands of little regions in the brain ("voxels") respond to stimuli. That means estimating tens of thousands of parameters. I don't think they'd be happy if their 99% intervals turned out to contain the right answer for only 1% of the voxels. But posterior betting odds don't have to have anything to do with how often bets are right, and usually they don't.

Q: Isn't "usually" very strong there?

A: No, I don't think so. D. A. S. Fraser has a wonderful paper, which should be better known, called "Is Bayes Posterior Just Quick and Dirty Confidence", and his answer to his own question is basically "Yes. Yes it is." More formally, he shows that the conditions for Bayesian credible sets to have correct coverage, to be confidence sets, are incredibly restrictive.

Q: But what about the Bernstein-von Mises theorem? Doesn't it say we don't have to worry for big samples, that credible sets are asymptotically confidence sets?

A: Not really. It says that if you have a fixed-dimensional model, and the usual regularity conditions for maximum likelihood estimation hold, so that \( \hat{\theta}_{MLE} \rightsquigarrow \mathcal{N}(\theta, n^{-1}I(\theta)) \), and some more regularity conditions hold, then the posterior distribution is also asymptotically \( \mathcal{N}(\theta, n^{-1}I(\theta)) \).

Q: Wait, so the theorem says that when it applies, if I want to be Bayesian I might as well just skip all the MCMC and maximize the likelihood?

A: You might well think that. You might very well think that. I couldn't possibly comment.

Q: !?!

A: Except to add that the theorem breaks down in the high-dimensional regime where the number of parameters grows with the number of samples, and goes to hell in the non-parametric regime of infinite-dimensional parameters. (In fact, Fraser gives one-dimensional examples where the mis-match between Bayesian credible levels and actual coverage is asymptotically \( O(1) \).) As Freedman said, if you want a confidence set, you need to build a confidence set, not mess around with credible sets.

Q: But surely coverage — "confidence" — isn't all that's needed? Suppose I have only a discrete parameter space, and for each point I flip a coin which comes up heads with probability \( \alpha \). Now my \( C_{\alpha} \) is all the parameter points where the coin came up heads. Its expected coverage is \( \alpha \), as claimed. In fact, if I can come up with a Gygax test, say using the low-significance digits of \( x \), I could invert that to get my confidence set, and get coverage of \( \alpha \) exactly. What then?

A: I never said that coverage was all we needed from a set-valued estimator. It should also be consistent: as we get more data, the set should narrow in on \( \theta \), no matter what \( \theta \) happens to be. Your Gygax sets won't do that. My point is that if you're going to use probabilities ought to mean something, not just refer to some imaginary gambling racket going on in your head.

Q: I am not going to let this point go so easily. It seems like you're insisting on calibration for Bayesian credible sets, that the fraction of them covering the truth be (about) the stated probability, right?

A: That seems like a pretty minimal requirement for treating supposed probabilities seriously. If (as on Discworld) "million to one chances turn up nine times out of ten", they're not really million to one.

Q: Fine — but isn't the Bayesian agent calibrated with probability 1?

A: With subjective probability 1. But failure of calibration is actually typical or generic, in the topological sense.

Q: But maybe the world we live in isn't "typical" in that weird sense of the topologists?

A: Maybe! The fact that Bayesian agents put probability 1 on the "meager" set of sample paths where they are calibrated implies that lots of stochastic processes are supported on topologically-atypical sets of paths. But now we're leaning a lot on a pre-established harmony between the world and our prior-and-model.

Q: Let me take another tack. What if calibration typically fails, but typically fails just a little — say probabilities are really \( p(1 \pm \epsilon ) \) when we think they're \( p \). Would you be very concerned, if \( \epsilon \) were small enough?

A: Honestly, no, but I have no good reason to think that, in general, approximate calibration or coverage is much more common that exact calibration. Anyway, we know that credible probabilities can be radically, dismally off as coverage probabilities, so it seems like a moot point.

Q: So what sense do you make of the uncertainties which come out of Bayesian procedures?

A: "If we started with a population of guesses distributed like this, and then selectively bred them to match the data, here's the dispersion of the final guesses."

Q: You don't think that sounds both thin and complicated?

A: Of course it's both. (And it only gets more complicated if I explain "selective breeding" and "matching the data".) But it's the best sense I can make, these days, of Bayesian uncertainty quantification as she is computed.

Q: And what's your alternative?

A: I want to know about how differently the experiment, the estimate, could have turned out, even if the underlying reality were the same. Standard errors — or median absolute errors, etc. — and confidence sets are about that sort of uncertainty, about re-running the experiment. You might mess up, because your model is wrong, but at least there's a sensible notion of probability in there, referring to things happening in the world. The Bayesian alternative is some sort of sub-genetic-algorithm evolutionary optimization routine you are supposedly running in your mind, while I run a different one in my mind, etc.

Q: But what about all the criticisms of p-values and null hypothesis significance tests and so forth?

A: They all have Bayesian counterparts, as people like Andy Gelman and Christian Robert know very well. The difficulties aren't about not being Bayesian, but about things like testing stupid hypotheses, not accounting for multiple testing or model search, selective reporting, insufficient communication, etc. But now we're in danger of drifting really far from our starting point about uncertainty in estimation.

Q: Would you sum that up then?

A: I don't believe the uncertainties you get from just slapping a prior on something, even if you've chosen your prior so the MAP or the posterior mean matches some reasonable penalized estimator. Give me some reason to think that your posterior probabilities have some contact with reality, or I'll just see them as "quick and dirty confidence" — only often not so quick and very dirty.

Q: Is that what you're going to put in your referee report?

A: I'll be more polite.

Disclaimer: Not a commentary on any specific talk, paper, or statistician. One reason this is a failed attempt at a dialogue is that there is more Q could have said in defense of the Bayesian approach, or at least in objection to A. (I take some comfort in the fact that it's traditional for characters in dialogues to engage in high-class trolling.) Also, the non-existent Robberbaron-Bloodmoney University is not to be confused with the very real Carnegie Mellon University; for instance, the latter lacks a hyphen.

Manual trackback: Source-Filter.

Bayes, anti-Bayes; Enigmas of Chance; Dialogues

Posted at February 21, 2016 20:23 | permanent link

Three-Toed Sloth