Multiple Testing, Multiple Comparisons
Last update: 21 Apr 2025 21:17First version: 23 November 2024
Bonferroni
As every school-child knows,
statistical hypothesis tests have error rates. Say that the false positive
rate of your test is
There are lots of reasons you might find Bonferroni correction uncomfortable. In the first place, as every school-child knows, reducing the false positive rate of a test comes at the expense of increasing the false negative rate --- doing Bonferroni correction implies you lose (some) "power" to detect true positives. To some extent that's just the math's way of saying that you can't have every nice thing at once, but there are two further, more technical, issues with Bonferroni correction: it makes an extreme, implausible assumption about the probabilistic dependence between tests, and it guards against a really extreme notion of error. Both of these make it "conservative", i.e., contribute to its demand for very small false positive rates and so low power.
- Extreme dependence across tests: Recall that for any two events
and , . (This is just avoiding double-counting.) So if, and only if, : and are either actually mutually exclusive, or at least they have together only with probability 0. For simplicity, I will call this "being mutually exclusive", and take the except-on-a-set-of-probability-zero caveats as read. Bonferroni works by adding up the probabilities of false positives across tests, so it assumes positive outcomes across tests are mutually exclusive. This is a very strong, and often very implausible, sort of probabilistic dependence (*). - An extreme notion of error: Recall that Bonferroni is worrying about the probability of making any errors at all, when all the null hypotheses are true. What if we can live with some false positives, so long as there aren't too many?
Let me try to bring the last point alive, by drawing a caricaturing a common modern application of multiple hypothesis testing, namely functional magnetic resonance imaging (fMRI) in neuroscience. We rig up some big magnets in just such a way that they let us measure the how much oxygen is being delivered by blood flow to someone's brain. Each measurement takes a certain amount of time, and is localized to a particular point in space. Our computer divides someone's brain up into voxels (="volume pixels", i.e., little blocky chunks), and we get measurements of the blood oxygen levels in each voxel, first as the person does one task (like counting backwards by sevens while not thinking of white elephants), and then in some other "condition" (like counting white elephants which appear in groups of eight). We take the contrast between conditions in each voxel, to see which parts of the brain are more active when, and test whether it's significantly different from zero. Those with significant differences tell us (hopefully; supposedly) something about what parts of the brain are involved in the task. This means we're doing (at least) as many hypothesis tests as there are voxels. For the first such data set I worked with, that was about sixteen thousand tests. Having one or two extra voxels included as responsive is probably fine! Yes, it'd be embarrassing to have false positives when there was really nothing going on (the deservedly-classic citation is Bennett et al., 2010), but if that's really a concern, you should be doing a better experiment and not wasting those very expensive magnets.
Benjamini-Hochberg
What I was calling "the probability of any false positives whatsoever" is known in the literature, for historical reasons, as the "family-wise error rate" (FWER). The classic paper by Benjamini and Hochberg (1995) introduced a different notion of error for multiple testing, the "false discovery rate" (FDR), "the expected proportion of errors among the rejected [null] hypotheses" (p. 292). That is, it's the proportion of all positive results, all "discoveries", which are, in fact, false. If we pick one of the things we think is a result, what's the probability that it's actually just noise? (In terms of my fMRI caricature: what's the probability that a randomly-selected active voxel is noise?)
If, in fact, all of the null hypotheses are true, we haven't really gained
anything: in that setting, FDR=FWER. But otherwise, FDR
Benjamini and Hochberg also gave a beautifully simple procedure to control
the FDR. For each hypotheses test, calculate the
We now reject all and only the hypotheses where
(TODO: sketch an argument that's not as complicated as the original B-H proof, because expanding their two pages of algebra to do all the steps will make undergrad eyes glaze over.)
Now at this point I should really describe the Devlin-Roeder "genomic control" idea, but that will have to wait for another time when my toddler is napping.
--- The further wrinkle I'm interested in, professionally, is situations where which hypotheses we test is contingent on the outcome of earlier tests...
--- Bonferroni correction is easy to transfer from multiple hypothesis testing to multiple confidence sets. It's less clear to me that transferring FDR control to confidence sets really makes sense.
*: Or is it? Bonferroni says that to achieve an over-all false positive rate of
- See also:
- Confidence Sets
- Statistics
- Recommended, big picture:
- Craig M. Bennett, Abigail A. Baird, Michael B. Miller and George L. Wolford, "Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction", Journal of Serendipitous and Unexpected Results 1 (2010): 1--5 [PDF]
- Yoav Benjamini and Yosef Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing", Journal of the Royal Statistical Society B 57 (1995): 289--300 [PDF reprint via Prof. Benjamini]
- Andrew Gelman, Jennifer Hill and Masanao Yajima, "Why we (usually) don't have to worry about multiple comparisons", Journal of Research on Educational Effectiveness 5 (2012): 189--211, arxiv:0907.2478
- Recommended, close-ups (very miscellaneous and CMU-centric):
- Bernie Devlin and Kathryn Roeder, "Genomic Control for Association Studies", Biometrics 55 (1999): 997--1004
- Bernie Devlin, Kathryn Roeder and Larry Wasserman, "Genomic Control, a New Approach to Genetic-Based Association Studies", Theoretical Population Biology 60 (2001): 155--166 [Preprint version]
- Aurore Delaigle and Peter Hall, "Higher Criticism in the Context of Unknown Distribution, Non-Independence and Classification", pp. 109--138 of Sastry, Rao, Delampady and Rajeev (eds.), Platinum Jubilee Proceedings of the Indian Statistical Institute
- David Donoho and Jiashun Jin, "Higher criticism for detecting sparse heterogeneous mixtures", Annals of Statistics 32 (2004); 962--994, arxiv:math.ST/0410072
- Christopher Genovese and Larry Wasserman, "A Stochastic Process Approach to False Discovery Control", Annals of Statistics 32 (2004): 1035--1061
- To read:
- Felix Abramovich, Yoav Benjamini, David L. Donoho and Iain M. Johnstone, "Adapting to Unknown Sparsity by controlling the False Discovery Rate", math.ST/0505374
- Yoav Benjamini, Marina Bogomolov, "Adjusting for selection bias in testing multiple families of hypotheses", arxiv:1106.3670
- Yoav Benjamini, Vered Madar and Phillip B. Stark, "Simultaneous confidence intervals uniformly more likely to determine signs", Biometrika 100 (2013): 283--300
- Lucien Birgé, A New Lower Bound for Multiple Hypothesis Testing", IEEE Transactions on Information Theory 51 (2005): 1611--1615
- Zhiyi Chi, "Effects of statistical dependence on multiple testing under a hidden Markov model", Annals of Statistics 39 (2011): 439--473
- Sandy Clarke, Peter Hall, "Robustness of multiple testing procedures against dependence", Annals of Statistics 37 (2009): 332--358, arxiv:0903.0464
- Arthur Cohen and Harold B. Sackrowitz, "Decision theory results for one-sided multiple comparison procedures", Annals of Statistics 33 (2005): 126--144, math.ST/0504505
- Arthur Cohen, Harold B. Sackrowitz, Minya Xu, "A new multiple testing method in the dependent case", Annals of Statistics 37 (2009) 1518--1544, arxiv:0906.3082
- Bradley Efron, "Size, power and false discovery rates", Annals of Statistics 35 (2007): 1351--1377, arxiv:0710.2245
- Werner Ehm, Jürgen Kornmeier, and Sven P. Heinrich, "Multiple testing along a tree", Electronic Journal of Statistics 4 (2010): 461--471 =? arxiv:0902.2296
- Janos Galambos and Italo Simonelli, Bonferroni-type Inequalities with Applications [I need to finish one of these decades]
- E. L. Lehmann and Joseph P. Romano, "Generalizations of the Familywise Error Rate", Annals of Statistics 33 (2005): 1138--1154, math.ST/0507420
- Nicolai Meinshausen and John Rice, "Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses", Annals of Statistics 34 (2006): 373--393, math.ST/0501289
- Andrey Novikov, "Sequential multiple hypothesis testing in presence of control variables", Kybernetika 45 (2009): 507--528, arxiv:0812.2712
- Guenther Walther, "The Average Likelihood Ratio for Large-scale Multiple Testing and Detecting Sparse Mixtures", arxiv:1111.0328
- Wei Biao Wu, "On false discovery control under dependence", Annals of Statistics 36 (2008): 364--380, arxiv:0903.1971