Multiple Testing, Multiple Comparisons

Last update: 07 Jul 2025 12:21
First version: 23 November 2024

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

Bonferroni

As every school-child knows, statistical hypothesis tests have error rates. Say that the false positive rate of your test is $\alpha$; you have thought a lot about this and worked hard to ensure it is that low. This is fine if you run one test. But now suppose you run $m$ different tests. And further suppose that all your null hypotheses are true. (Maybe your are an evidence-based augur.) What's the probability that none of the tests report a false positive? Well, by elementary probability, it's somewhere between $\alpha$ and $m\alpha$. (Both extremes imply a lot of probabilistic dependence between the test outcomes.) If $m$ is very large, then $m\alpha$ is a very bad error rate! This does however suggest an easy fix: re-jigger your individual tests so that they each have a false positive rate of $\alpha/m$. (If $m$ is large, this is a very small error rate.) This is "Bonferroni correction".

There are lots of reasons you might find Bonferroni correction uncomfortable. In the first place, as every school-child knows, reducing the false positive rate of a test comes at the expense of increasing the false negative rate --- doing Bonferroni correction implies you lose (some) "power" to detect true positives. To some extent that's just the math's way of saying that you can't have every nice thing at once, but there are two further, more technical, issues with Bonferroni correction: it makes an extreme, implausible assumption about the probabilistic dependence between tests, and it guards against a really extreme notion of error. Both of these make it "conservative", i.e., contribute to its demand for very small false positive rates and so low power.

Extreme dependence across tests: Recall that for any two events $A$ and $B$, $\Prob{A \cup B} = \Prob{A} + \Prob{B} - \Prob{A \cap B}$. (This is just avoiding double-counting.) So $\Prob{A \cup B} = \Prob{A} + \Prob{B}$ if, and only if, $\Prob{A \cap B} = 0$: $A$ and $B$ are either actually mutually exclusive, or at least they have together only with probability 0. For simplicity, I will call this "being mutually exclusive", and take the except-on-a-set-of-probability-zero caveats as read. Bonferroni works by adding up the probabilities of false positives across tests, so it assumes positive outcomes across tests are mutually exclusive. This is a very strong, and often very implausible, sort of probabilistic dependence (*).
An extreme notion of error: Recall that Bonferroni is worrying about the probability of making any errors at all, when all the null hypotheses are true. What if we can live with some false positives, so long as there aren't too many?

Let me try to bring the last point alive, by drawing a caricaturing a common modern application of multiple hypothesis testing, namely functional magnetic resonance imaging (fMRI) in neuroscience. We rig up some big magnets in just such a way that they let us measure the how much oxygen is being delivered by blood flow to someone's brain. Each measurement takes a certain amount of time, and is localized to a particular point in space. Our computer divides someone's brain up into voxels (="volume pixels", i.e., little blocky chunks), and we get measurements of the blood oxygen levels in each voxel, first as the person does one task (like counting backwards by sevens while not thinking of white elephants), and then in some other "condition" (like counting white elephants which appear in groups of eight). We take the contrast between conditions in each voxel, to see which parts of the brain are more active when, and test whether it's significantly different from zero. Those with significant differences tell us (hopefully; supposedly) something about what parts of the brain are involved in the task. This means we're doing (at least) as many hypothesis tests as there are voxels. For the first such data set I worked with, that was about sixteen thousand tests. Having one or two extra voxels included as responsive is probably fine! Yes, it'd be embarrassing to have false positives when there was really nothing going on (the deservedly-classic citation is Bennett et al., 2010), but if that's really a concern, you should be doing a better experiment and not wasting those very expensive magnets.

Benjamini-Hochberg

What I was calling "the probability of any false positives whatsoever" is known in the literature, for historical reasons, as the "family-wise error rate" (FWER). The classic paper by Benjamini and Hochberg (1995) introduced a different notion of error for multiple testing, the "false discovery rate" (FDR), "the expected proportion of errors among the rejected [null] hypotheses" (p. 292). That is, it's the proportion of all positive results, all "discoveries", which are, in fact, false. If we pick one of the things we think is a result, what's the probability that it's actually just noise? (In terms of my fMRI caricature: what's the probability that a randomly-selected active voxel is noise?)

If, in fact, all of the null hypotheses are true, we haven't really gained anything: in that setting, FDR=FWER. But otherwise, FDR $\leq$ FWER, so we can have very substantial gains in power, at the cost of allowing some false positives. (In terms of my fMRI caricature: we have more ability to notice faint genuine activations of voxels.)

Benjamini and Hochberg also gave a beautifully simple procedure to control the FDR. For each hypotheses test, calculate the $p$-value, and list those values in increasing order, so $ P_1 \leq P_2 \leq \ldots \leq P_n $. Say that desired FDR is $\gamma$. Then find the largest $c$ where \[ P_{c} \leq \frac{c}{n}\gamma \] Graphically: we draw a plot of index number $i$ versus $p$-value $ P_i $, and then draw a line across it of slope $\gamma/n$. When the line crosses the curve, that's our $ P_c $.

We now reject all and only the hypotheses where $ P_i \leq P_c $. The theorem Benjamini and Hochberg proved (p. 293) is that this guarantees the FDR is $\leq \gamma$, provided that the $p$-values are independent under the null. (They can have arbitrary dependence under the alternatives.)

(TODO: sketch an argument that's not as complicated as the original B-H proof, because expanding their two pages of algebra to do all the steps will make undergrad eyes glaze over.)

Now at this point I should really describe the Devlin-Roeder "genomic control" idea, but that will have to wait for another time when my toddler is napping.

--- The further wrinkle I'm interested in, professionally, is situations where which hypotheses we test is contingent on the outcome of earlier tests...

--- Bonferroni correction is easy to transfer from multiple hypothesis testing to multiple confidence sets. It's less clear to me that transferring FDR control to confidence sets really makes sense.

*: Or is it? Bonferroni says that to achieve an over-all false positive rate of $\alpha$ with $m$ tests, each test's individual false positive rate should be $\alpha/m$. If the tests were themselves probabilistically independent, though, with false positive rate $p$, the probability that none of the individual tests give false positives is $(1-p)^m$, so we want to solve $1-\alpha = (1-p)^m$ for $p$, which yields $p=1-(1-\alpha)^{1/m} \approx \alpha/m$ when $\alpha$ is small or $m$ is large or both. (Said differently, $(1-p)^m \approx 1-pm$ when $p$ is very small.) --- In words, if we've got a large number of rare events, being independent is almost the same as being mutually exclusive, because it's really unlikely more than one of them happens. So if we want to get corrections which are much more optimistic than Bonferroni, mere independence across the tests isn't enough, they'll have to be positively associated. ^

Confidence Sets
Statistics

Craig M. Bennett, Abigail A. Baird, Michael B. Miller and George L. Wolford, "Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction", Journal of Serendipitous and Unexpected Results 1 (2010): 1--5 [PDF]
Yoav Benjamini and Yosef Hochberg, "Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing", Journal of the Royal Statistical Society B 57 (1995): 289--300 [PDF reprint via Prof. Benjamini]
Andrew Gelman, Jennifer Hill and Masanao Yajima, "Why we (usually) don't have to worry about multiple comparisons", Journal of Research on Educational Effectiveness 5 (2012): 189--211, arxiv:0907.2478

Bernie Devlin and Kathryn Roeder, "Genomic Control for Association Studies", Biometrics 55 (1999): 997--1004
Bernie Devlin, Kathryn Roeder and Larry Wasserman, "Genomic Control, a New Approach to Genetic-Based Association Studies", Theoretical Population Biology 60 (2001): 155--166 [Preprint version]
Aurore Delaigle and Peter Hall, "Higher Criticism in the Context of Unknown Distribution, Non-Independence and Classification", pp. 109--138 of Sastry, Rao, Delampady and Rajeev (eds.), Platinum Jubilee Proceedings of the Indian Statistical Institute
David Donoho and Jiashun Jin, "Higher criticism for detecting sparse heterogeneous mixtures", Annals of Statistics 32 (2004); 962--994, arxiv:math.ST/0410072
Christopher Genovese and Larry Wasserman, "A Stochastic Process Approach to False Discovery Control", Annals of Statistics 32 (2004): 1035--1061

Felix Abramovich, Yoav Benjamini, David L. Donoho and Iain M. Johnstone, "Adapting to Unknown Sparsity by controlling the False Discovery Rate", math.ST/0505374
Yoav Benjamini, Marina Bogomolov, "Adjusting for selection bias in testing multiple families of hypotheses", arxiv:1106.3670
Yoav Benjamini, Vered Madar and Phillip B. Stark, "Simultaneous confidence intervals uniformly more likely to determine signs", Biometrika 100 (2013): 283--300
Lucien Birgé, A New Lower Bound for Multiple Hypothesis Testing", IEEE Transactions on Information Theory 51 (2005): 1611--1615
Zhiyi Chi, "Effects of statistical dependence on multiple testing under a hidden Markov model", Annals of Statistics 39 (2011): 439--473
Sandy Clarke, Peter Hall, "Robustness of multiple testing procedures against dependence", Annals of Statistics 37 (2009): 332--358, arxiv:0903.0464
Arthur Cohen and Harold B. Sackrowitz, "Decision theory results for one-sided multiple comparison procedures", Annals of Statistics 33 (2005): 126--144, math.ST/0504505
Arthur Cohen, Harold B. Sackrowitz, Minya Xu, "A new multiple testing method in the dependent case", Annals of Statistics 37 (2009) 1518--1544, arxiv:0906.3082
Bradley Efron, "Size, power and false discovery rates", Annals of Statistics 35 (2007): 1351--1377, arxiv:0710.2245
Werner Ehm, Jürgen Kornmeier, and Sven P. Heinrich, "Multiple testing along a tree", Electronic Journal of Statistics 4 (2010): 461--471 =? arxiv:0902.2296
Janos Galambos and Italo Simonelli, Bonferroni-type Inequalities with Applications [I need to finish one of these decades]
E. L. Lehmann and Joseph P. Romano, "Generalizations of the Familywise Error Rate", Annals of Statistics 33 (2005): 1138--1154, math.ST/0507420
Nicolai Meinshausen and John Rice, "Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses", Annals of Statistics 34 (2006): 373--393, math.ST/0501289
Andrey Novikov, "Sequential multiple hypothesis testing in presence of control variables", Kybernetika 45 (2009): 507--528, arxiv:0812.2712
Guenther Walther, "The Average Likelihood Ratio for Large-scale Multiple Testing and Detecting Sparse Mixtures", arxiv:1111.0328
Wei Biao Wu, "On false discovery control under dependence", Annals of Statistics 36 (2008): 364--380, arxiv:0903.1971