Notebooks

## Frequentist Consistency of Bayesian Procedures

15 Jan 2016 12:58

"Bayesian consistency" is usually taken to mean showing that, under Bayesian updating, the posterior probability concentrates on the true model. That is, for every (measurable) set of hypotheses containing the truth, the posterior probability goes to 1. (In practice one shows that the posterior probability of any set not containing the truth goes to zero.) There is a basic result here, due to Doob, which essentially says that the Bayesian learner is consistent, except on a set of data of prior probability zero. That is, the Bayesian is subjectively certain they will converge on the truth. This is not as reassuring as one might wish, and showing Bayesian consistency under the true distribution is harder. In fact, it usually involves assumptions under which non-Bayes procedures will also converge. These are things like the existence of very powerful consistent hypothesis tests (an approach favored by Ghosal, van der Vaart, et al., supposedly going back to Le Cam), or, inspired by learning theory, constraints on the effective size of the hypothesis space which are gradually relaxed as the sample size grows (as in Barron et al.). If these assumptions do not hold, one can construct situations in which Bayesian procedures are inconsistent.

Concentration of the posterior around the truth is only a preliminary. One would also want to know that, say, the posterior mean converges, or even better that the predictive distribution converges. For many finite-dimensional problems, what's called the "Bernstein-von Mises theorem" basically says that the posterior mean and the maximum likelihood estimate converge, so if one works the other will too. This breaks down for infinite-dimensional problems.

(PAC-Bayesian results don't fit into this picture particularly neatly. Essentially, they say that if you find a set of classifiers which all classify correctly in-sample, and ask about the average out-of-sample performance, the bounds on the latter are tighter for big sets than for small ones. This is for the unmysterious reason that it takes a bigger coincidence for many bad classification rules to happen to all work on the training data than for a few bad rules to get lucky. The actual Bayesian machinery of posterior updating doesn't really come into play, at least not in the papers I've seen.)

I believe I have contributed a Result to this area, on what happens when the data are dependent and all the models are mis-specified, but some are more mis-specified than others. This turns on realizing that Bayesian updating is just a special case of evolutionary search, i.e., an infinite-dimensional stochastic replicator equation.

Query: are there any situations where Bayesian methods are consistent but no non-Bayesian method is? (My recollection is that John Earman, in Bayes or Bust, provides a negative answer, but I forget how.)

Recommended:
• Andrew Barron, Mark J. Schervish and Larry Wasserman, "The Consistency of Posterior Distributions in Nonparametric Problems", Annals of Statistics 27 (1999): 536--561 [While I am biased — Mark and Larry are senior faculty here — I think this is definitely one of the best-written papers on the topic.]
• Gordon Belot
• Robert H. Berk [Old but quite nice papers on the effect of mis-specification, though with IID data assumed, and stronger assumptions about the models than modern writers are comfortable with.]
• David Blackwell and Lester Dubins, "Merging of Opinions with Increasing Information", Annals of Mathematical Statistics 33 (1962): 882--886
• Taeryon Choi, R. V. Ramamoorthi, "Remarks on consistency of posterior distributions", arxiv:0805.3248
• Ronald Christensen, "Inconsistent Bayesian Estimation", Bayesian Analysis 4 (2009): 413--416 [An extremely simple example of how inconsistency can be generated]
• Dennis D. Cox, "An Analysis of Bayesian Inference for Nonparametric Regression", Annals of Statistics 21 (1993): 903--923
• Persi Diaconis and David Freedman, "On the Consistency of Bayes Estimates", The Annals of Statistics 14 (1986): 1--26 [With accompanying discussion; the latter is worth reading if only to fully savor the academic snark in Diaconis and Freedman's reply.]
• David Freedman, "On the Bernstein-von Mises Theorem with Infinite-Dimensional Parameters", Annals of Statistics 27 (1999): 1119--1140 [As you know, Bob, the Bernstein-von Mises theorem asserts that, "under the usual conditions", in the large sample limit the distribution of the maximum likelihood estimate is basically the same as the Bayesian posterior distribution, so you can take credible intervals as approximate confidence intervals and vice versa. It turns out that the usual conditions can fail drastically even for very simple infinite-dimensional problems.]
• Subhashis Ghosal, "A review of consistency and convergence rates of posterior distribution" [PDF]
• Subhashis Ghosal, Jayanta K. Ghosh and R. V. Ramamoorthi, "Consistency Issues in Bayesian Nonparametrics" [Review of the IID case, on Ghosal's website someplace]
• Subhashis Ghosal, Jayanta K. Ghosh and Aad W. van der Vaart, "Convergence Rates of Posterior Distributions", Annals of Statistics 28 (2000): 500--531
• Subhashis Ghosal and Yongqiang Tang, "Bayesian Consistency for Markov Processes", Sankhya 68 (2006): 227--239 [This is slick, but I think the cuteness of the proof of the main theorem is achieved at the cost of the ugliness of verifying the main conditions, as in their example. (That may just be jealousy speaking.) PDF]
• Subhashis Ghosal and Aad van der Vaart, "Convergence Rates of Posterior Distributions for Non-IID Observations", Annals of Statistics 35 (2007): 192--223
• J. K. Ghosh and R. V. Ramamoorthi, Bayesian Nonparametrics [Mini-review]
• Peter Grünwald, "Bayesian Inconsistency under Misspecification" [PDF preprint of talk given at the Valencia 8 meeting in 2006]
• Peter Grünwald and John Langford, "Suboptimal behavior of Bayes and MDL in classification under misspecification", Machine Learning 66 (2007): 119--149 [PDF reprint via Prof. Grünwald]
• B. J. K. Kleijn and A. W. van der Vaart, "Misspecification in infinite-dimensional Bayesian statistics", Annals of Statistics 34 (2006): 837--877
• Antonio Lijoi, Igor Prunster and Stephen G. Walker, "Bayesian Consistency for Stationary Models", Econometric Theory 23 (2007): 749--759 [Gives a Doob-style result, that the prior probability of failing to converge is zero.]
• David A. McAllester, "Some PAC-Bayesian Theorems", Machine Learning 37 (1999): 355--363
• Jeffrey W. Miller, Matthew T. Harrison, "A simple example of Dirichlet process mixture inconsistency for the number of components", arxiv:1301.2708
• Richard Nickl, "Discussion of 'Frequentist coverage of adaptive nonparametric Bayesian credible sets'", Annals of Statistics 43 (2015): 1429--1436, arxiv:1410.7600
• Lorraine Schwartz, "On Bayes Procedures", Z. Wahrsch. Verw. Gebiete 4 (1965): 10--26 [The journal now known as Probability Theory and Related Fields]
• X. Shen and Larry Wasserman, "Rates of convergence of posterior distributions", Annals of Statistics 29 (2001): 687--714
• Vladimir Spokoiny, "Bernstein - von Mises Theorem for growing parameter dimension", arxiv:1302.3430
• Stephen Walker, "New Approaches to Bayesian Consistency", Annals of Statistics 32 (2004): 2028--2043 = math.ST/0503672 [Clever martingale tricks.]
• Yang Xing, "Convergence rates of posterior distributions for observations without the iid structure", arxiv:0811.4677
• Yang Xing and Bo Ranneby, "Both necessary and sufficient conditions for Bayesian exponential consistency", arxiv:0812.1084 [Essentially, a unifying presentation of several existing conditions for IID samples.]
• Tong Zhang, "From $\epsilon$-entropy to KL-entropy: Analysis of minimum information complexity density estimation", Annals of Statistics 34 (2006): 2180--2210 = arxiv:math.ST/0702653
To write:
• CRS, "Bayesian Learning, Information Theory, and Evolutionary Search"