Frequentist Consistency of Bayesian Procedures
23 Jul 2024 13:37
"Bayesian consistency" is usually taken to mean showing that, under Bayesian updating, the posterior probability concentrates on the true model. That is, for every (measurable) set of hypotheses containing the truth, the posterior probability goes to 1. (In practice one shows that the posterior probability of any set not containing the truth goes to zero.) There is a basic result here, due to Doob, which essentially says that the Bayesian learner is consistent, except on a set of data of prior probability zero. That is, the Bayesian is subjectively certain they will converge on the truth. This is not as reassuring as one might wish, and showing Bayesian consistency under the true distribution is harder. In fact, it usually involves assumptions under which non-Bayes procedures will also converge. These are things like the existence of very powerful consistent hypothesis tests (an approach favored by Ghosal, van der Vaart, et al., supposedly going back to Le Cam), or, inspired by learning theory, constraints on the effective size of the hypothesis space which are gradually relaxed as the sample size grows (as in Barron et al.). If these assumptions do not hold, one can construct situations in which Bayesian procedures are inconsistent.
Concentration of the posterior around the truth is only a preliminary. One would also want to know that, say, the posterior mean converges, or even better that the predictive distribution converges. For many finite-dimensional problems, what's called the "Bernstein-von Mises theorem" basically says that the posterior mean and the maximum likelihood estimate converge, so if one works the other will too. This breaks down for infinite-dimensional problems.
(PAC-Bayesian results don't fit into this picture particularly neatly. Essentially, they say that if you find a set of classifiers which all classify correctly in-sample, and ask about the average out-of-sample performance, the bounds on the latter are tighter for big sets than for small ones. This is for the unmysterious reason that it takes a bigger coincidence for many bad classification rules to happen to all work on the training data than for a few bad rules to get lucky. The actual Bayesian machinery of posterior updating doesn't really come into play, at least not in the papers I've seen.)
I believe I have contributed a Result to this area, on what happens when the data are dependent and all the models are mis-specified, but some are more mis-specified than others. This turns on realizing that Bayesian updating is just a special case of evolutionary search, i.e., an infinite-dimensional stochastic replicator equation.
Query: are there any situations where Bayesian methods are consistent but no non-Bayesian method is? (My recollection is that John Earman, in Bayes or Bust, provides a negative answer, but I forget how.)
- Recommended:
- Andrew Barron, Mark J. Schervish and Larry Wasserman, "The Consistency of Posterior Distributions in Nonparametric Problems", Annals of Statistics 27 (1999): 536--561 [While I am biased — Mark and Larry are senior faculty here — I think this is definitely one of the best-written papers on the topic.]
- Gordon Belot
- "Failure of Calibration is Typical", arxiv:1306.4943
- "Bayesian Orgulity", Philosophy of Science 80 (2013): 483--503 [An exposition of the limitations of Bayesian consistency for philosophers, and reflections on its implications]
- Robert H. Berk [Old but quite nice papers on the
effect of mis-specification, though with IID data assumed, and stronger
assumptions about the models than modern writers are comfortable with.]
- "Limiting Behavior of Posterior Distributions when the Model is Incorrect", Annals of Mathematical Statistics 37 (1966): 51--58 [see also the correction]
- "Consistency a Posteriori", Annals of Mathematical Statistics 41 (1970): 894--906
- David Blackwell and Lester Dubins, "Merging of Opinions with Increasing Information", Annals of Mathematical Statistics 33 (1962): 882--886
- Taeryon Choi, R. V. Ramamoorthi, "Remarks on consistency of posterior distributions", arxiv:0805.3248
- Ronald Christensen, "Inconsistent Bayesian Estimation", Bayesian Analysis 4 (2009): 413--416 [An extremely simple example of how inconsistency can be generated]
- Dennis D. Cox, "An Analysis of Bayesian Inference for Nonparametric Regression", Annals of Statistics 21 (1993): 903--923
- Persi Diaconis and David Freedman, "On the Consistency of Bayes Estimates", The Annals of Statistics 14 (1986): 1--26 [With accompanying discussion; the latter is worth reading if only to fully savor the academic snark in Diaconis and Freedman's reply.]
- Ignacio Esponda, Demian Pouzo, Yuichi Yamamoto, "Asymptotic Behavior of Bayesian Learners with Misspecified Models", arxiv:1904.08551
- David Freedman, "On the Bernstein-von Mises Theorem with Infinite-Dimensional Parameters", Annals of Statistics 27 (1999): 1119--1140 [As you know, Bob, the Bernstein-von Mises theorem asserts that, "under the usual conditions", in the large sample limit the distribution of the maximum likelihood estimate is basically the same as the Bayesian posterior distribution, so you can take credible intervals as approximate confidence intervals and vice versa. It turns out that the usual conditions can fail drastically even for very simple infinite-dimensional problems.]
- Subhashis Ghosal, "A review of consistency and convergence rates of posterior distribution" [PDF]
- Subhashis Ghosal, Jayanta K. Ghosh and R. V. Ramamoorthi, "Consistency Issues in Bayesian Nonparametrics" [Review of the IID case, on Ghosal's website someplace]
- Subhashis Ghosal, Jayanta K. Ghosh and Aad W. van der Vaart, "Convergence Rates of Posterior Distributions", Annals of Statistics 28 (2000): 500--531
- Subhashis Ghosal and Yongqiang Tang, "Bayesian Consistency for Markov Processes", Sankhya 68 (2006): 227--239 [This is slick, but I think the cuteness of the proof of the main theorem is achieved at the cost of the ugliness of verifying the main conditions, as in their example. (That may just be jealousy speaking.) PDF]
- Subhashis Ghosal and Aad van der Vaart, "Convergence Rates of Posterior Distributions for Non-IID Observations", Annals of Statistics 35 (2007): 192--223
- J. K. Ghosh and R. V. Ramamoorthi, Bayesian Nonparametrics [Mini-review]
- Peter Grünwald, "Bayesian Inconsistency under Misspecification" [PDF preprint of talk given at the Valencia 8 meeting in 2006]
- Peter Grünwald and John Langford, "Suboptimal behavior of Bayes and MDL in classification under misspecification", Machine Learning 66 (2007): 119--149 [PDF reprint via Prof. Grünwald]
- B. J. K. Kleijn and A. W. van der Vaart, "Misspecification in infinite-dimensional Bayesian statistics", Annals of Statistics 34 (2006): 837--877
- Antonio Lijoi, Igor Prunster and Stephen G. Walker, "Bayesian Consistency for Stationary Models", Econometric Theory 23 (2007): 749--759 [Gives a Doob-style result, that the prior probability of failing to converge is zero.]
- David A. McAllester, "Some PAC-Bayesian Theorems", Machine Learning 37 (1999): 355--363
- Jeffrey W. Miller and Matthew T. Harrison [Moral: generally, the posterior distribution on the number of clusters in Bayesian clustering does not converge on the true number of clusters]
- "A simple example of Dirichlet process mixture inconsistency for the number of components", pp. 199--206 of Burges et al. (eds), NIPS 2013, arxiv:1301.2708
- "Inconsistency of Pitman-Yor Process Mixtures for the Number of Components", Journal of Machine Learning Research 15 (2014): 3333--3370
- Richard Nickl, "Discussion of 'Frequentist coverage of adaptive nonparametric Bayesian credible sets'", Annals of Statistics 43 (2015): 1429--1436, arxiv:1410.7600
- Lorraine Schwartz, "On Bayes Procedures", Z. Wahrsch. Verw. Gebiete 4 (1965): 10--26 [The journal now known as Probability Theory and Related Fields]
- X. Shen and Larry Wasserman, "Rates of convergence of posterior distributions", Annals of Statistics 29 (2001): 687--714
- Vladimir Spokoiny, "Bernstein - von Mises Theorem for growing parameter dimension", arxiv:1302.3430
- Tom F. Sterkenburg, "Solomonoff Prediction and Occam's Razor", Philosophy of Science 83 (2016): 459--479, phil-sci/12429
- Stephen Walker, "New Approaches to Bayesian Consistency", Annals of Statistics 32 (2004): 2028--2043 = math.ST/0503672 [Clever martingale tricks.]
- Yang Xing, "Convergence rates of posterior distributions for observations without the iid structure", arxiv:0811.4677
- Yang Xing and Bo Ranneby, "Both necessary and sufficient conditions for Bayesian exponential consistency", arxiv:0812.1084 [Essentially, a unifying presentation of several existing conditions for IID samples.]
- Tong Zhang, "From $\epsilon$-entropy to KL-entropy: Analysis of minimum information complexity density estimation", Annals of Statistics 34 (2006): 2180--2210 = arxiv:math.ST/0702653
- Modesty forbids me to recommend:
- CRS, "Dynamics of Bayesian Updating with Dependent Data and Mis-specified Models", arxiv:0901.1342 = Electronic Journal of Statistics 3 (2009): 1039--1074 [Less-technical explanation of the paper]
- Sabina J. Sloman, Daniel M. Oppenheimer, Stephen B. Broomell, and CRS, "Characterizing the robustness of Bayesian adaptive experimental designs to active learning bias", arxiv:2205.13698
- To read:
- Arash A. Amini, XuanLong Nguyen, "Bayesian inference as iterated random functions with applications to sequential inference in graphical models", arxiv:1311.0072
- Julyan Arbel, Ghislaine Gayraud, Judith Rousseau, "Bayesian Optimal Adaptive Estimation Using a Sieve Prior", Scandinavian Journal of Statistics 40 (2013): 549--570
- P. J. Bickel and B. J. K. Kleijn, "The semiparametric Bernstein-von Mises theorem", Annals of Statistics 40 (2012): 206--237
- Natalia A. Bochkina, Peter J. Green
- "Consistency and efficiency of Bayesian estimators in generalised linear inverse problems", arxiv:1110.3015
- "The Bernstein-von Mises theorem for non-regular generalised linear inverse problems", arxiv:1211.3434
- "The Bernstein–von Mises theorem and nonregular models", Annals of Statistics 42 (2014): 1850--1878 (the same as the above?)
- Antonio Canale, Pierpaolo De Blasi, "Posterior consistency of nonparametric location-scale mixtures for multivariate density estimation", arxiv:1306.2671
- Ismaël Castillo, "On Bayesian supremum norm contraction rates", Annals of Statistics 42 (2014): 2058--2091, arxiv:1304.1761
- Ismaël Castillo, Gerard Kerkyacharian, Dominique Picard, "Thomas Bayes' walk on manifolds", arxiv:1206.0459
- Ismaël Castillo and Richard Nickl
- "Nonparametric Bernstein-von Mises Theorems in Gaussian White Noise", Annals of Statistics 41 (2013): 1999--2028, arxiv:1208.3862
- "On the Bernstein-von Mises phenomenon for nonparametric Bayes procedures", Annals of Statistics 42 (2014): 1941--1969, arxiv:1310.2484
- Ismaël Castillo and Judith Rousseau, "A General Bernstein--von Mises Theorem in semiparametric models", arxiv:1305.4482
- Ismaël Castillo and Aad van der Vaart, "Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences", Annals of Statistics 40 (2012): 2069--2101
- Marta Catalano, Pierpaolo De Blasi, Antonio Lijoi, Igor Pruenster, "Posterior Asymptotics for Boosted Hierarchical Dirichlet Process Mixtures", Journal of Machine Learning Research 23 (2022): 80
- Masoumeh Dashti, Kody J. H. Law, Andrew M. Stuart, Jochen Voss, "MAP Estimators and Their Consistency in Bayesian Nonparametric Inverse Problems", arxiv:1303.4795
- René de Jonge and Harry van Zanten, "Semiparametric Bernstein-von Mises for the error standard deviation", Electronic Journal of Statistics 7 (2013): 217--243
- Pierpaolo De Blasi and Stephen G. Walker, "Bayesian Estimation of the Discrepancy with Misspecified Parametric Models", Bayesian Analysis 8 (2013): 781--800
- J. L. Doob, "Application of the theory of martingales", pp. 23--27
in Colloques Internationaux du Centre National de la Recherche
Scientifique, no. 13, Centre National de la Recherche Scientifique,
Paris, 1949 [Summary
in Mathematical
Reviews by William Feller]
- Bradley Efron, "Bayesian inference and the parametric bootstrap", Annals of Applied Statistics 6 (2012): 1971--1997
- Stefano Favaro, Alessandra Guglielmi, and Stephen G. Walker, "A class of measure-valued Markov chains and Bayesian nonparametrics", Bernoulli 18 (2012): 1002--1030
- Subhashis Ghosal, Jüri Lember and Aad van der Vaart, "Nonparametric Bayesian model selection and averaging", Electronic Journal of Statistics 2 (2008): 63--89
- Evarist Giné and Richard Nickl, "Rates of contraction for posterior distributions in $L^r$-metrics, $1 \leq r \leq \infty$", Annals of Statistics 39 (2011): 2883--2911
- Peter Grünwald, "The Safe Bayesian: Learning the Learning Rate via the Mixability Gap" [PDF preprint]
- Peter Grünwald, Thijs van Ommen, "Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It", arxiv:1412.3730
- Shota Gugushvili, Peter Spreij, "A note on non-parametric Bayesian estimation for Poisson point processes", arxiv:1304.7353
- Marc Hoffmann, Judith Rousseau, Johannes Schmidt-Hieber, "On adaptive posterior concentration rates", arxiv:1305.5270
- Marcus Hutter, "Exact Non-Parametric Bayesian Inference on Infinite Trees", arxiv:0903.5342
- Bas Kleijn, Bartek Knapik, "Semiparametric posterior limits under local asymptotic exponentiality", arxiv:1210.6204
- B. J. K. Kleijn and A. W. van der Vaart, "The Bernstein-Von-Mises theorem under misspecification", Electronic Journal of Statistics 6 (2012): 354--381
- Willem Kruijer, Aad van der Vaart, "Analyzing posteriors by the information inequality", pp. 227--240 in Banerjee et al., From Probability to Statistics and Back: High-Dimensional Models and Processes
- John Langford, "Tutorial on Practical Prediction Theory for Classification", Journal of Machine Learning Research 6 (2005): 273--306 [For the PAC-Bayesian result]
- Lucien LeCam, "On the Speed of Convergence of Posterior Distributions" [PDF]
- Claudio Macci, Mauro Piccioni, "An inverse Sanov theorem for exponential families", arxiv:2111.14152
- Ryan Martin, "A martingale law of large numbers and convergence rates of Bayesian posterior distributions", arxiv:1201.3102
- Ryan Martin, Liang Hong, "On convergence rates of Bayesian predictive densities and posterior distributions", arxiv:1210.0103
- David A. McAllester, "PAC-Bayesian Stochastic Model Selection", Machine Learning 51 (2003): 5--21
- Kevin McGoff, Sayan Mukherjee, Andrew Nobel, "Gibbs posterior convergence and the thermodynamic formalism", arxiv:1901.08641
- XuanLong Nguyen
- "Borrowing strength in hierarchical Bayes: convergence of the Dirichlet base measure", arxiv:1301.0802
- "Convergence of latent mixing measures in finite and infinite mixture models", Annals of Statistics 41 (2013): 370--400, arxiv:1109.3250
- Houman Owhadi, Clint Scovel
- "Brittleness of Bayesian inference and new Selberg formulas", arxiv:1304.7046
- "Qualitative Robustness in Bayesian Inference", arxiv:1411.3984
- Houman Owhadi, Clint Scovel, Tim Sullivan, "Bayesian Brittleness: Why no Bayesian model is "good enough"", arxiv:1304.6772
- Debdeep Pati, Anirban Bhattacharya, Natesh S. Pillai, David B. Dunson, "Posterior contraction in sparse Bayesian factor models for massive covariance matrices", arxiv:1206.3627
- R. V. Ramamoorthi, Karthik Sriram, Ryan Martin, "On posterior concentration in misspecified models", arxiv:1312.4620
- Y. Ritov, P. J. Bickel, A. Gamst, B. J. K. Kleijn, "The Bayesian Analysis of Complex, High-Dimensional Models: Can it be CODA?", arxiv:1203.5471
- Vincent Rivoirard, Judith Rousseau
- "Bernstein Von Mises Theorem for linear functionals of the density", arxiv:0908.4167
- "Posterior Concentration Rates for Infinite Dimensional Exponential Families", Bayesian Analysis 7 (2012): 311--334
- Jean-Bernard Salomond, "Concentration rate and consistency of the posterior under monotonicity constraints", arxiv:1301.1898
- Alessio Sancetta, "Universality of Bayesian Predictions", Bayesian Analysis 7 (2012): 1--36
- Karthik Sriram, R.V. Ramamoorthi, "Posterior consistency in misspecified models for i.n.i.d response", arxiv:1408.6015
- Karthik Sriram, R.V. Ramamoorthi, and Pulak Ghosh, "Posterior Consistency of Bayesian Quantile Regression Based on the Misspecified Asymmetric Laplace Density", Bayesian Analysis 8 (2013): 479--504
- Yan Sun, Qifan Song and Faming Liang, "Consistent Sparse Deep Learning: Theory and Computation", Journal of the American Statistical Association 117 (2022): 1981--1995
- Botond Szabo, Aad van der Vaart, Harry van Zanten, "Frequentist coverage of adaptive nonparametric Bayesian credible sets", arxiv:1310.4489
- Frank van der Meulen and Harry van Zanten, "Consistent nonparametric Bayesian inference for discretely observed scalar diffusions", Bernoulli 19 (2103): 44--63
- A. W. van der Vaart, J. H. van Zanten, "Rates of contraction of posterior distributions based on Gaussian process priors", Annals of Statistics 36 (2008): 1435--1463, arxiv:0806.3024
- Elodie Vernet, "Posterior consistency for nonparametric Hidden Markov Models with finite state space", arxiv:1311.3092
- Yuefeng Wu, Subhashis Ghosal, "Kullback Leibler property of kernel mixture priors in Bayesian density estimation", Electronic Journal of Statistics 2 (2008): 298--331, arxiv:0710.2746
- To write:
- CRS, "Bayesian Learning, Information Theory, and Evolutionary Search"