## Frequentist Consistency of Bayesian Procedures

*10 Sep 2023 10:35*

"Bayesian consistency" is usually taken to mean showing that, under Bayesian
updating, the posterior probability concentrates on the true model. That is,
for every (measurable) set of hypotheses containing the truth, the posterior
probability goes to 1. (In practice one shows that the posterior probability
of any set not containing the truth goes to zero.) There is a basic result
here, due to Doob, which essentially says that the Bayesian learner is
consistent, except on a set of data of *prior* probability zero. That
is, the Bayesian is *subjectively* certain they will converge on the
truth. This is not as reassuring as one might wish, and showing Bayesian
consistency *under the true distribution* is harder. In fact, it
usually involves assumptions under which *non-Bayes* procedures will
also converge. These are things like the existence of very powerful consistent
hypothesis tests (an approach favored by Ghosal, van der Vaart, et al.,
supposedly going back to Le Cam), or, inspired
by learning theory, constraints on the
effective size of the hypothesis space which are gradually relaxed as the
sample size grows (as in Barron et al.). If these assumptions do not hold, one
can construct situations in which Bayesian procedures are inconsistent.

Concentration of the posterior around the truth is only a preliminary. One would also want to know that, say, the posterior mean converges, or even better that the predictive distribution converges. For many finite-dimensional problems, what's called the "Bernstein-von Mises theorem" basically says that the posterior mean and the maximum likelihood estimate converge, so if one works the other will too. This breaks down for infinite-dimensional problems.

(PAC-Bayesian results don't fit into this picture particularly neatly.
Essentially, they say that if you find a set of classifiers which all classify
correctly in-sample, and ask about the average out-of-sample performance, the
bounds on the latter are tighter for big sets than for small ones. This is for
the unmysterious reason that it takes a bigger coincidence for *many*
bad classification rules to happen to *all* work on the training data
than for a *few* bad rules to get lucky. The actual Bayesian machinery
of posterior updating doesn't really come into play, at least not in the papers
I've seen.)

I believe I have contributed a Result to this area, on what happens when the data are dependent and all the models are mis-specified, but some are more mis-specified than others. This turns on realizing that Bayesian updating is just a special case of evolutionary search, i.e., an infinite-dimensional stochastic replicator equation.

*Query*: are there any situations where Bayesian methods are
consistent but no non-Bayesian method is? (My recollection is that John
Earman, in Bayes or Bust, provides a negative answer, but I forget
how.)

- Recommended:
- Andrew Barron, Mark J. Schervish and Larry Wasserman, "The
Consistency of Posterior Distributions in Nonparametric Problems", Annals
of Statistics
**27**(1999): 536--561 [While I am biased — Mark and Larry are senior faculty here — I think this is definitely one of the best-written papers on the topic.] - Gordon Belot
- "Failure of Calibration is Typical", arxiv:1306.4943
- "Bayesian
Orgulity", Philosophy of
Science
**80**(2013): 483--503 [An exposition of the limitations of Bayesian consistency for philosophers, and reflections on its implications]

- Robert H. Berk [Old but quite nice papers on the
effect of mis-specification, though with IID data assumed, and stronger
assumptions about the models than modern writers are comfortable with.]
- "Limiting Behavior of Posterior Distributions when the Model is Incorrect", Annals of Mathematical Statistics
**37**(1966): 51--58 [see also the correction] - "Consistency a Posteriori", Annals of Mathematical Statistics
**41**(1970): 894--906

- "Limiting Behavior of Posterior Distributions when the Model is Incorrect", Annals of Mathematical Statistics
- David Blackwell and Lester Dubins, "Merging of Opinions with Increasing Information", Annals of Mathematical Statistics
**33**(1962): 882--886 - Taeryon Choi, R. V. Ramamoorthi, "Remarks on consistency of posterior distributions", arxiv:0805.3248
- Ronald Christensen, "Inconsistent Bayesian Estimation",
Bayesian Analysis
**4**(2009): 413--416 [An extremely simple example of how inconsistency can be generated] - Dennis D. Cox, "An Analysis of Bayesian Inference for Nonparametric
Regression", Annals
of Statistics
**21**(1993): 903--923 - Persi Diaconis and David
Freedman, "On the
Consistency of Bayes Estimates", The Annals of
Statistics
**14**(1986): 1--26 [With accompanying discussion; the latter is worth reading if only to fully savor the academic snark in Diaconis and Freedman's reply.] - Ignacio Esponda, Demian Pouzo, Yuichi Yamamoto, "Asymptotic Behavior of Bayesian Learners with Misspecified Models", arxiv:1904.08551
- David Freedman, "On the Bernstein-von Mises Theorem with
Infinite-Dimensional
Parameters", Annals
of Statistics
**27**(1999): 1119--1140 [As you know, Bob, the Bernstein-von Mises theorem asserts that, "under the usual conditions", in the large sample limit the distribution of the maximum likelihood estimate is basically the same as the Bayesian posterior distribution, so you can take credible intervals as approximate confidence intervals and vice versa. It turns out that the usual conditions can fail drastically even for very simple infinite-dimensional problems.] - Subhashis Ghosal, "A review of consistency and convergence rates of posterior distribution" [PDF]
- Subhashis Ghosal, Jayanta K. Ghosh and R. V. Ramamoorthi, "Consistency Issues in Bayesian Nonparametrics" [Review of the IID case, on Ghosal's website someplace]
- Subhashis Ghosal, Jayanta K. Ghosh and Aad W. van der Vaart,
"Convergence Rates of Posterior
Distributions", Annals
of Statistics
**28**(2000): 500--531 - Subhashis Ghosal and Yongqiang Tang, "Bayesian Consistency for
Markov Processes", Sankhya
**68**(2006): 227--239 [This is slick, but I think the cuteness of the proof of the main theorem is achieved at the cost of the ugliness of verifying the main conditions, as in their example. (That may just be jealousy speaking.) PDF] - Subhashis Ghosal and Aad van der Vaart, "Convergence Rates of
Posterior Distributions for Non-IID
Observations", Annals of
Statistics
**35**(2007): 192--223 - J. K. Ghosh and R. V. Ramamoorthi, Bayesian Nonparametrics [Mini-review]
- Peter Grünwald, "Bayesian Inconsistency under Misspecification" [PDF preprint of talk given at the Valencia 8 meeting in 2006]
- Peter Grünwald and John Langford, "Suboptimal behavior of
Bayes and MDL in classification under
misspecification", Machine
Learning
**66**(2007): 119--149 [PDF reprint via Prof. Grünwald] - B. J. K. Kleijn and A. W. van der Vaart, "Misspecification in
infinite-dimensional Bayesian
statistics", Annals of
Statistics
**34**(2006): 837--877 - Antonio Lijoi, Igor Prunster and Stephen G. Walker, "Bayesian
Consistency for Stationary
Models", Econometric
Theory
**23**(2007): 749--759 [Gives a Doob-style result, that the*prior*probability of failing to converge is zero.] - David A. McAllester, "Some PAC-Bayesian
Theorems", Machine
Learning
**37**(1999): 355--363 - Jeffrey W. Miller and Matthew T. Harrison [Moral: generally, the posterior distribution on the number of clusters in Bayesian clustering does not converge on the true number of clusters]
- "A simple example of Dirichlet process mixture inconsistency for the number of components", pp. 199--206 of Burges et al. (eds), NIPS 2013, arxiv:1301.2708
- "Inconsistency of Pitman-Yor Process Mixtures for the Number of Components", Journal of Machine Learning Research
**15**(2014): 3333--3370

- Richard Nickl, "Discussion of 'Frequentist coverage of adaptive nonparametric Bayesian credible sets'", Annals of Statistics
**43**(2015): 1429--1436, arxiv:1410.7600 - Lorraine Schwartz, "On Bayes
Procedures", Z. Wahrsch. Verw. Gebiete
**4**(1965): 10--26 [The journal now known as Probability Theory and Related Fields] - X. Shen and Larry Wasserman, "Rates of convergence of posterior
distributions", Annals of Statistics
**29**(2001): 687--714 - Vladimir Spokoiny, "Bernstein - von Mises Theorem for growing parameter dimension", arxiv:1302.3430
- Tom F. Sterkenburg, "Solomonoff Prediction and Occam's Razor",
Philosophy of Science
**83**(2016): 459--479, phil-sci/12429 - Stephen Walker, "New Approaches to Bayesian Consistency",
Annals of Statistics
**32**(2004): 2028--2043 = math.ST/0503672 [Clever martingale tricks.] - Yang Xing, "Convergence rates of posterior distributions for observations without the iid structure", arxiv:0811.4677
- Yang Xing and Bo Ranneby, "Both necessary and sufficient conditions for Bayesian exponential consistency", arxiv:0812.1084 [Essentially, a unifying presentation of several existing conditions for IID samples.]
- Tong Zhang,
"From $\epsilon$-entropy to KL-entropy: Analysis of minimum information complexity density estimation", Annals of Statistics
**34**(2006): 2180--2210 = arxiv:math.ST/0702653

- Modesty forbids me to recommend:
- CRS, "Dynamics of Bayesian Updating with Dependent Data and
Mis-specified
Models", arxiv:0901.1342
= Electronic Journal of
Statistics
**3**(2009): 1039--1074 [Less-technical explanation of the paper] - Sabina J. Sloman, Daniel M. Oppenheimer, Stephen B. Broomell, and CRS, "Characterizing the robustness of Bayesian adaptive experimental designs to active learning bias", arxiv:2205.13698

- To read:
- Arash A. Amini, XuanLong Nguyen, "Bayesian inference as iterated random functions with applications to sequential inference in graphical models", arxiv:1311.0072
- Julyan Arbel, Ghislaine Gayraud, Judith Rousseau, "Bayesian Optimal Adaptive Estimation Using a Sieve Prior", Scandinavian Journal of Statistics
**40**(2013): 549--570 - P. J. Bickel and B. J. K. Kleijn, "The semiparametric Bernstein-von Mises theorem", Annals of Statistics
**40**(2012): 206--237 - Natalia A. Bochkina, Peter J. Green
- "Consistency and efficiency of Bayesian estimators in generalised linear inverse problems", arxiv:1110.3015
- "The Bernstein-von Mises theorem for non-regular generalised linear inverse problems", arxiv:1211.3434
- "The Bernsteinâ€“von Mises theorem and nonregular models", Annals of Statistics
**42**(2014): 1850--1878 (the same as the above?)

- Antonio Canale, Pierpaolo De Blasi, "Posterior consistency of nonparametric location-scale mixtures for multivariate density estimation", arxiv:1306.2671
- Ismaël Castillo, "On Bayesian supremum norm contraction rates", Annals of Statistics
**42**(2014): 2058--2091, arxiv:1304.1761 - Ismaël Castillo, Gerard Kerkyacharian, Dominique Picard, "Thomas Bayes' walk on manifolds", arxiv:1206.0459
- Ismaël Castillo and Richard Nickl
- "Nonparametric Bernstein-von Mises Theorems in Gaussian White Noise", Annals of Statistics
**41**(2013): 1999--2028, arxiv:1208.3862 - "On the Bernstein-von Mises phenomenon for nonparametric Bayes procedures", Annals of Statistics
**42**(2014): 1941--1969, arxiv:1310.2484

- "Nonparametric Bernstein-von Mises Theorems in Gaussian White Noise", Annals of Statistics
- Ismaël Castillo and Judith Rousseau, "A General Bernstein--von Mises Theorem in semiparametric models", arxiv:1305.4482
- Ismaël Castillo and Aad van der Vaart, "Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences",
Annals of Statistics
**40**(2012): 2069--2101 - Marta Catalano, Pierpaolo De Blasi, Antonio Lijoi, Igor Pruenster, "Posterior Asymptotics for Boosted Hierarchical Dirichlet Process Mixtures", Journal of Machine Learning Research
**23**(2022): 80 - Masoumeh Dashti, Kody J. H. Law, Andrew M. Stuart, Jochen Voss, "MAP Estimators and Their Consistency in Bayesian Nonparametric Inverse Problems", arxiv:1303.4795
- René de Jonge and Harry van Zanten, "Semiparametric Bernstein-von Mises for the error standard deviation", Electronic Journal of Statistics
**7**(2013): 217--243 - Pierpaolo De Blasi and Stephen G. Walker, "Bayesian Estimation of the Discrepancy with Misspecified Parametric Models", Bayesian Analysis
**8**(2013): 781--800 - J. L. Doob, "Application of the theory of martingales", pp. 23--27 in Colloques Internationaux du Centre National de la Recherche Scientifique, no. 13, Centre National de la Recherche Scientifique, Paris, 1949 [Summary in Mathematical Reviews by William Feller]
- Bradley Efron, "Bayesian inference and the parametric bootstrap", Annals of Applied Statistics
**6**(2012): 1971--1997 - Stefano Favaro, Alessandra Guglielmi, and Stephen G. Walker, "A class of measure-valued Markov chains and Bayesian nonparametrics",
Bernoulli
**18**(2012): 1002--1030 - Subhashis Ghosal, Jüri Lember and Aad van der Vaart, "Nonparametric Bayesian model selection and averaging", Electronic Journal of Statistics
**2**(2008): 63--89 - Evarist Giné and Richard Nickl, "Rates of contraction for posterior distributions in $L^r$-metrics, $1 \leq r \leq \infty$",
Annals of Statistics
**39**(2011): 2883--2911 - Peter Grünwald, "The Safe Bayesian: Learning the Learning Rate via the Mixability Gap" [PDF preprint]
- Peter Grünwald, Thijs van Ommen, "Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It", arxiv:1412.3730
- Shota Gugushvili, Peter Spreij, "A note on non-parametric Bayesian estimation for Poisson point processes", arxiv:1304.7353
- Marc Hoffmann, Judith Rousseau, Johannes Schmidt-Hieber, "On adaptive posterior concentration rates", arxiv:1305.5270
- Marcus Hutter, "Exact Non-Parametric Bayesian Inference on Infinite Trees", arxiv:0903.5342
- Bas Kleijn, Bartek Knapik, "Semiparametric posterior limits under local asymptotic exponentiality", arxiv:1210.6204
- B. J. K. Kleijn and A. W. van der Vaart, "The Bernstein-Von-Mises
theorem under
misspecification", Electronic
Journal of Statistics
**6**(2012): 354--381 - Willem Kruijer, Aad van der Vaart, "Analyzing posteriors by the information inequality", pp. 227--240 in Banerjee et al., From Probability to Statistics and Back: High-Dimensional Models and Processes
- John Langford, "Tutorial on Practical Prediction Theory for
Classification", Journal of Machine Learning Research
**6**(2005): 273--306 [For the PAC-Bayesian result] - Lucien LeCam, "On the Speed of Convergence of Posterior Distributions" [PDF]
- Ryan Martin, "A martingale law of large numbers and convergence rates of Bayesian posterior distributions", arxiv:1201.3102
- Ryan Martin, Liang Hong, "On convergence rates of Bayesian predictive densities and posterior distributions", arxiv:1210.0103
- David A. McAllester, "PAC-Bayesian Stochastic Model
Selection", Machine
Learning
**51**(2003): 5--21 - Kevin McGoff, Sayan Mukherjee, Andrew Nobel, "Gibbs posterior convergence and the thermodynamic formalism", arxiv:1901.08641
- XuanLong Nguyen
- "Borrowing strength in hierarchical Bayes: convergence of the Dirichlet base measure", arxiv:1301.0802
- "Convergence of latent mixing measures in finite and infinite mixture models", Annals of Statistics
**41**(2013): 370--400, arxiv:1109.3250

- Houman Owhadi, Clint Scovel
- "Brittleness of Bayesian inference and new Selberg formulas", arxiv:1304.7046
- "Qualitative Robustness in Bayesian Inference", arxiv:1411.3984

- Houman Owhadi, Clint Scovel, Tim Sullivan, "Bayesian Brittleness: Why no Bayesian model is "good enough"", arxiv:1304.6772
- Debdeep Pati, Anirban Bhattacharya, Natesh S. Pillai, David B. Dunson, "Posterior contraction in sparse Bayesian factor models for massive covariance matrices", arxiv:1206.3627
- R. V. Ramamoorthi, Karthik Sriram, Ryan Martin, "On posterior concentration in misspecified models", arxiv:1312.4620
- Y. Ritov, P. J. Bickel, A. Gamst, B. J. K. Kleijn, "The Bayesian Analysis of Complex, High-Dimensional Models: Can it be CODA?", arxiv:1203.5471
- Vincent Rivoirard, Judith Rousseau
- "Bernstein Von Mises Theorem for linear functionals of the density", arxiv:0908.4167
- "Posterior Concentration Rates for Infinite
Dimensional Exponential Families", Bayesian Analysis
**7**(2012): 311--334

- Jean-Bernard Salomond, "Concentration rate and consistency of the posterior under monotonicity constraints", arxiv:1301.1898
- Alessio Sancetta, "Universality of Bayesian Predictions",
Bayesian Analysis
**7**(2012): 1--36 - Karthik Sriram, R.V. Ramamoorthi, "Posterior consistency in misspecified models for i.n.i.d response", arxiv:1408.6015
- Karthik Sriram, R.V. Ramamoorthi, and Pulak Ghosh, "Posterior Consistency of Bayesian Quantile Regression Based on the Misspecified Asymmetric Laplace Density", Bayesian Analysis
**8**(2013): 479--504 - Yan Sun, Qifan Song and Faming Liang, "Consistent Sparse Deep Learning: Theory and Computation", Journal of the American Statistical Association
**117**(2022): 1981--1995 - Botond Szabo, Aad van der Vaart, Harry van Zanten, "Frequentist coverage of adaptive nonparametric Bayesian credible sets", arxiv:1310.4489
- Frank van der Meulen and Harry van Zanten, "Consistent nonparametric Bayesian inference for discretely observed scalar diffusions", Bernoulli
**19**(2103): 44--63 - A. W. van der Vaart, J. H. van Zanten, "Rates of contraction of posterior distributions based on Gaussian process priors", Annals of
Statistics
**36**(2008): 1435--1463, arxiv:0806.3024 - Elodie Vernet, "Posterior consistency for nonparametric Hidden Markov Models with finite state space", arxiv:1311.3092
- Yuefeng Wu, Subhashis Ghosal, "Kullback Leibler property of kernel mixture priors in Bayesian density estimation", Electronic Journal
of Statistics
**2**(2008): 298--331, arxiv:0710.2746

- To write:
- CRS, "Bayesian Learning, Information Theory, and Evolutionary Search"