December 28, 2009

Significance, Power, and the Will to Believe

Attention conservation notice: 2100 words on parallels between statistical hypothesis testing and Jamesian pragmatism; an idea I've been toying with for a decade without producing anything decisive or practical. Contains algebraic symbols and long quotations from ancient academic papers. Also some history-of-ideas speculation by someone who is not a historian.

When last we saw the Neyman-Pearson lemma, we were looking at how to tell whether a data set x was signal or noise, assuming that we know the statistical distributions of noise (call it p) and the distribution of signals (q). There are two kinds of mistake we can make here: a false alarm, saying "signal" when x is really noise, and a miss, saying "noise" when x is really signal. What Neyman and Pearson showed is that if we fix on a false alarm rate we can live with (a probability of mistaking noise for signal; the "significance level"), there is a unique optimal test which minimizes the probability of misses --- which maximizes the power to detect signal when it is present. This is the likelihood ratio test, where we say "signal" if and only if q(x)/p(x) exceeds a certain threshold picked to control the false alarm rate.

The Neyman-Pearson lemma comes from their 1933 paper; but the distinction between the two kinds of errors, which is clearly more fundamental. Where does it come from?

The first place Neyman and/or Pearson use it, that I can see, is their 1928 paper (in two parts), where it's introduced early and without any fanfare. I'll quote it, but with some violence to their notation, and omitting footnoted asides (from p. 177 of part I; "Hypothesis A" is what I'm calling "noise"):

Setting aside the possibility that the sampling has not been random or that the population has changed during its course, x must either have been drawn randomly from p or from q, where the latter is some other population which may have any one of an infinite variety of forms differing only slightly or very greatly from p. The nature of the problem is such that it is impossible to find criteria which will distinguish exactly between these alternatives, and whatever method we adopt two sources of error must arise:
  1. Sometimes, when Hypothesis A is rejected, x will in fact have been drawn from p.
  2. More often, in accepting Hypothesis A, x will really have been drawn from q.

In the long run of statistical experience the frequency of the first source of error (or in a single instance its probability) can be controlled by choosing as a discriminating contour, one outside which the frequency of occurrence of samples from p is very small — say, 5 in 100 or 5 in 1000. In the density space such a contour will include almost the whole weight of the field. Clearly there will be an infinite variety of systems from which it is possible to choose a contour satisfying such a condition....

The second source of error is more difficult to control, but if wrong judgments cannot be avoided, their seriousness will at any rate be diminished if on the whole Hypothesis A is wrongly accepted only in cases where the true sampled population, q, differs but slightly from p.

The 1928 paper goes on to say that, intuitively, it stands to reason that the likelihood ratio is the right way to accomplish this. The point of the 1933 paper is to more rigorously justify the use of the likelihood ratio (hence the famous "lemma", which is really not set off as a separate lemma...). Before unleashing the calculus of variations, however, they warm up with some more justification (pp. 295--296 of their 1933):
Let us now for a moment consider the form in which judgments are made in practical experience. We may accept or we may reject a hypothesis with varying degrees of confidence; or we may decide to remain in doubt. But whatever conclusion is reached the following position must be recognized. If we reject H0, we may reject it when it is true; if we accept H0, we may be accepting it when it is false, that is to say, when really some alternative Ht is true. These two sources of error can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. We are reminded of the old problem considered by LAPLACE of the number of votes in a court of judges that should be needed to convict a prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the consequences of the error; is the punishment death or fine; what is the danger to the community of released criminals; what are the current ethical views on punishment? From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimised. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator.
(Neither Laplace nor LAPLACE, are mentioned in their 1928 paper.)

Let's step back a little bit to consider the broader picture here. We have a question about what the world is like --- which of several conceivable hypotheses is true. Some hypotheses are ruled out on a priori grounds, others because they are incompatible with evidence, but that still leaves more than one admissible hypothesis, and the evidence we have does not conclusively favor any of them. Nonetheless, we must chose one hypothesis for purposes of action; at the very least we will act as though one of them is true. But we may err just as much through rejecting a truth as through accepting a falsehood. The two errors are symmetric, but they are not the same error. In this situation, we are advised to pick a hypothesis based, in part, on which error has graver consequences.

This is precisely the set-up of William James's "The Will to Believe". (It's easily accessible online, as are summaries and interpretations; for instance, an application to current controversies by Jessa Crispin.) In particular, James lays great stress on the fact that what statisticians now call Type I and Type II errors are both errors:

There are two ways of looking at our duty in the matter of opinion, — ways entirely different, and yet ways about whose difference the theory of knowledge seems hitherto to have shown very little concern. We must know the truth; and we must avoid error, — these are our first and great commandments as would-be knowers; but they are not two ways of stating an identical commandment, they are two separable laws. Although it may indeed happen that when we believe the truth A, we escape as an incidental consequence from believing the falsehood B, it hardly ever happens that by merely disbelieving B we necessarily believe A. We may in escaping B fall into believing other falsehoods, C or D, just as bad as B; or we may escape B by not believing anything at all, not even A.

Believe truth! Shun error! — these, we see, are two materially different laws; and by choosing between them we may end by coloring differently our whole intellectual life. We may regard the chase for truth as paramount, and the avoidance of error as secondary; or we may, on the other hand, treat the avoidance of error as more imperative, and let truth take its chance. Clifford ... exhorts us to the latter course. Believe nothing, he tells us, keep your mind in suspense forever, rather than by closing it on insufficient evidence incur the awful risk of believing lies. You, on the other hand, may think that the risk of being in error is a very small matter when compared with the blessings of real knowledge, and be ready to be duped many times in your investigation rather than postpone indefinitely the chance of guessing true. I myself find it impossible to go with Clifford. We must remember that these feelings of our duty about either truth or error are in any case only expressions of our passional life. Biologically considered, our minds are as ready to grind out falsehood as veracity, and he who says, "Better go without belief forever than believe a lie!" merely shows his own preponderant private horror of becoming a dupe. He may be critical of many of his desires and fears, but this fear he slavishly obeys. He cannot imagine any one questioning its binding force. For my own part, I have also a horror of being duped; but I can believe tbat worse things tban being doped may happen to a man in this world: so Clifford's exhortation has to my ears a thoroughly fantastic sound. It is like a general informing his soldiers that it is better to keep out of battle forever than to risk a single wound. Not so are victories either over enemies or over nature gained. Our errors are surely not such awfully solemn things. In a world where we are so certain to incur them in spite of all our caution, a certain lightness of heart seems healthier than this excessive nervousness on their behalf. At any rate, it seems the fittest thing for the empiricist philosopher.

From here the path to James's will to believe is pretty clear, at least in the form he advocated it, which is that of picking among hypotheses which are all "live"*, and where some choice must be made among them. What I am interested in, however, is not the use James made of this distinction, but simply the fact that he made it.

So far as I have been able to learn, no one drew this distinction between seeking truth and avoiding error before James, or if they did, they didn't make anything of it. (Even for Pascal in his wager, the idea that believing in Catholicism if it is false might be bad doesn't register.) Yet this is just what Neyman and Pearson were getting at, thirty-odd years later. There is no mention of James in these papers, or indeed of any other source. They present the distinction as though it were obvious, though eight decades of subsequent teaching experience shows it is anything but. Neyman and Pearson were very interested in the foundations of statistics, but seem to have paid no attention to earlier philosophers, except for the arguable case of Pearson's father Karl and his Grammar of Science (which does not seem to mention James). Yet there it is. It really looks like two independent inventions of the whole scheme for judging hypotheses.

My prejudices being what they are, I am much less inclined to think that James illuminates Neyman and Pearson than the other way around. James was, so to speak, arguing that we should trade significance — the risk of mistaking noise for signal — for power, finding some meaningful signal in what he elsewhere called the "blind molecular chaos" of the physical universe. Granting that there is a trade-off here, however, one has to wonder about how stark it really is (cf.), and whether his will-to-believe is really the best way to handle it. Neyman and Pearson suggest we should look for a procedure for resolving metaphysical questions which maximizes the ability to detect larger meanings for a given risk of seeing faces in clouds — and would let James and Clifford set their tolerance for that risk to their own satisfaction. Of course, any such procedure would have to squarely confront the fact that there may be no way of maximizing power against multiple alternatives simultaneously...

The extension to confidence sets, consisting of all hypotheses not rejected by suitably powerful tests (per Neyman 1937) is left as an exercise to the reader.

*: As an example of a "dead" hypothesis, James gives believing in "the Mahdi", presumably Muhammad Ahmad ibn as-Sayyid Abd Allah. I'm not a Muslim, and those of my ancestors who were certainly weren't Mahdists, but this was still a "What do you mean 'we', white man?" moment in my first reading of the essay. To be fair, James gives me many fewer such moments than most of his contemporaries.

Manual trackback: Brad DeLong; Robo; paperpools (I am not worthy!)

Enigmas of Chance; Philosophy; Modest Proposals

Posted at December 28, 2009 00:08 | permanent link

Three-Toed Sloth