Frequentist Consistency of Bayesian Procedures

27 Feb 2017 16:30

"Bayesian consistency" is usually taken to mean showing that, under Bayesian updating, the posterior probability concentrates on the true model. That is, for every (measurable) set of hypotheses containing the truth, the posterior probability goes to 1. (In practice one shows that the posterior probability of any set not containing the truth goes to zero.) There is a basic result here, due to Doob, which essentially says that the Bayesian learner is consistent, except on a set of data of prior probability zero. That is, the Bayesian is subjectively certain they will converge on the truth. This is not as reassuring as one might wish, and showing Bayesian consistency under the true distribution is harder. In fact, it usually involves assumptions under which non-Bayes procedures will also converge. These are things like the existence of very powerful consistent hypothesis tests (an approach favored by Ghosal, van der Vaart, et al., supposedly going back to Le Cam), or, inspired by learning theory, constraints on the effective size of the hypothesis space which are gradually relaxed as the sample size grows (as in Barron et al.). If these assumptions do not hold, one can construct situations in which Bayesian procedures are inconsistent.

Concentration of the posterior around the truth is only a preliminary. One would also want to know that, say, the posterior mean converges, or even better that the predictive distribution converges. For many finite-dimensional problems, what's called the "Bernstein-von Mises theorem" basically says that the posterior mean and the maximum likelihood estimate converge, so if one works the other will too. This breaks down for infinite-dimensional problems.

(PAC-Bayesian results don't fit into this picture particularly neatly. Essentially, they say that if you find a set of classifiers which all classify correctly in-sample, and ask about the average out-of-sample performance, the bounds on the latter are tighter for big sets than for small ones. This is for the unmysterious reason that it takes a bigger coincidence for many bad classification rules to happen to all work on the training data than for a few bad rules to get lucky. The actual Bayesian machinery of posterior updating doesn't really come into play, at least not in the papers I've seen.)

I believe I have contributed a Result to this area, on what happens when the data are dependent and all the models are mis-specified, but some are more mis-specified than others. This turns on realizing that Bayesian updating is just a special case of evolutionary search, i.e., an infinite-dimensional stochastic replicator equation.

Query: are there any situations where Bayesian methods are consistent but no non-Bayesian method is? (My recollection is that John Earman, in Bayes or Bust, provides a negative answer, but I forget how.)