Notebooks

Multiple Testing, Multiple Comparisons

Last update: 11 Dec 2024 10:04
First version: 23 November 2024

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

Bonferroni

As every school-child knows, statistical hypothesis tests have error rates. Say that the false positive rate of your test is $\alpha$; you have thought a lot about this and worked hard to ensure it is that low. This is fine if you run one test. But now suppose you run $m$ different tests. And further suppose that all your null hypotheses are true. (Maybe your are an evidence-based augur.) What's the probability that none of the tests report a false positive? Well, by elementary probability, it's somewhere between $\alpha$ and $m\alpha$. (Both extremes imply a lot of probabilistic dependence between the test outcomes.) If $m$ is very large, then $m\alpha$ is a very bad error rate! This does however suggest an easy fix: re-jigger your individual tests so that they each have a false positive rate of $\alpha/m$. (If $m$ is large, this is a very small error rate.) This is "Bonferroni correction".

There are lots of reasons you might find Bonferroni correction uncomfortable. In the first place, as every school-child knows, reducing the false positive rate of a test comes at the expense of increasing the false negative rate --- doing Bonferroni correction implies you lose (some) "power" to detect true positives. To some extent that's just the math's way of saying that you can't have every nice thing at once, but there are two further, more technical, issues with Bonferroni correction: it makes an extreme, implausible assumption about the probabilistic dependence between tests, and it guards against a really extreme notion of error. Both of these make it "conservative", i.e., contribute to its demand for very small false positive rates and so low power.

  1. Extreme dependence across tests: Recall that for any two events $A$ and $B$, $\Prob{A \cup B} = \Prob{A} + \Prob{B} - \Prob{A \cap B}$. (This is just avoiding double-counting.) So $\Prob{A \cup B} = \Prob{A} + \Prob{B}$ if, and only if, $\Prob{A \cap B} = 0$: $A$ and $B$ are either actually mutually exclusive, or at least they have together only with probability 0. For simplicity, I will call this "being mutually exclusive", and take the except-on-a-set-of-probability-zero caveats as read. Bonferroni works by adding up the probabilities of false positives across tests, so it assumes positive outcomes across tests are mutually exclusive. This is a very strong, and often very implausible, sort of probabilistic dependence (*).
  2. An extreme notion of error: Recall that Bonferroni is worrying about the probability of making any errors at all, when all the null hypotheses are true. What if we can live with some false positives, so long as there aren't too many?

Let me try to bring the last point alive, by drawing a caricaturing a common modern application of multiple hypothesis testing, namely functional magnetic resonance imaging (fMRI) in neuroscience. We rig up some big magnets in just such a way that they let us measure the how much oxygen is being delivered by blood flow to someone's brain. Each measurement takes a certain amount of time, and is localized to a particular point in space. Our computer divides someone's brain up into voxels (="volume pixels", i.e., little blocky chunks), and we get measurements of the blood oxygen levels in each voxel, first as the person does one task (like counting backwards by sevens while not thinking of white elephants), and then in some other "condition" (like counting white elephants which appear in groups of eight). We take the contrast between conditions in each voxel, to see which parts of the brain are more active when, and test whether it's significantly different from zero. Those with significant differences tell us (hopefully; supposedly) something about what parts of the brain are involved in the task. This means we're doing (at least) as many hypothesis tests as there are voxels. For the first such data set I worked with, that was about sixteen thousand tests. Having one or two extra voxels included as responsive is probably fine! Yes, it'd be embarrassing to have false positives when there was really nothing going on (the deservedly-classic citation is Bennett et al., 2010), but if that's really a concern, you should be doing a better experiment and not wasting those very expensive magnets.

Benjamini-Hochberg

What I was calling "the probability of any false positives whatsoever" is known in the literature, for historical reasons, as the "family-wise error rate" (FWER). The classic paper by Benjamini and Hochberg (1995) introduced a different notion of error for multiple testing, the "false discovery rate" (FDR), "the expected proportion of errors among the rejected [null] hypotheses" (p. 292). That is, it's the proportion of all positive results, all "discoveries", which are, in fact, false. If we pick one of the things we think is a result, what's the probability that it's actually just noise? (In terms of my fMRI caricature: what's the probability that a randomly-selected active voxel is noise?)

If, in fact, all of the null hypotheses are true, we haven't really gained anything: in that setting, FDR=FWER. But otherwise, FDR $\leq$ FWER, so we can have very substantial gains in power, at the cost of allowing some false positives. (In terms of my fMRI caricature: we have more ability to notice faint genuine activations of voxels.)

Benjamini and Hochberg also gave a beautifully simple procedure to control the FDR. For each hypotheses test, calculate the $p$-value, and list those values in increasing order, so \( P_1 \leq P_2 \leq \ldots \leq P_n \). Say that desired FDR is $\gamma$. Then find the largest $c$ where \[ P_{c} \leq \frac{c}{n}\gamma \] Graphically: we draw a plot of index number $i$ versus $p$-value \( P_i \), and then draw a line across it of slope $\gamma/n$. When the line crosses the curve, that's our \( P_c \).

We now reject all and only the hypotheses where \( P_i \leq P_c \). The theorem Benjamini and Hochberg proved (p. 293) is that this guarantees the FDR is $\leq \gamma$, provided that the $p$-values are independent under the null. (They can have arbitrary dependence under the alternatives.)

(TODO: sketch an argument that's not as complicated as the original B-H proof, because expanding their two pages of algebra to do all the steps will make undergrad eyes glaze over.)

Now at this point I should really describe the Devlin-Roeder "genomic control" idea, but that will have to wait for another time when my toddler is napping.

--- The further wrinkle I'm interested in, professionally, is situations where which hypotheses we test is contingent on the outcome of earlier tests...

--- Bonferroni correction is easy to transfer from multiple hypothesis testing to multiple confidence sets. It's less clear to me that transferring FDR control to confidence sets really makes sense.

*: Or is it? Bonferroni says that to achieve an over-all false positive rate of $\alpha$ with $m$ tests, each test's individual false positive rate should be $\alpha/m$. If the tests were themselves probabilistically independent, though, with false positive rate $p$, the probability that none of the individual tests give false positives is $(1-p)^m$, so we want to solve $1-\alpha = (1-p)^m$ for $p$, which yields $p=1-(1-\alpha)^{1/m} \approx \alpha/m$ when $\alpha$ is small or $m$ is large or both. (Said differently, $(1-p)^m \approx 1-pm$ when $p$ is very small.) --- In words, if we've got a large number of rare events, being independent is almost the same as being mutually exclusive, because it's really unlikely more than one of them happens. So if we want to get corrections which are much more optimistic than Bonferroni, mere independence across the tests isn't enough, they'll have to be positively associated. ^


Notebooks: