Attention conservation notice: 900 words of wondering what the scientific literature would look like if it were entirely a product of publication bias. Veils the hard-won discoveries of actual empirical scientists in vague, abstract, hyper-theoretical doubts, without alleging any concrete errors. A pile of skeptical nihilism, best refuted by going back to the lab.

I have been musing about the following scenario for several years now, without ever getting around to doing anything with it. Since it came up in conversation last month between talks in New York, now seems like as good a time as any to get it out of my system.

Imagine an epistemic community that seeks to discover which of a large
set of postulated phenomena actually happen. (The example I originally had in
mind was specific foods causing or preventing specific diseases, but it really
has nothing to do with causality, or observational versus experimental
studies.) Let's build a stochastic model of this. At each time step, an
investigator will draw a random candidate phenomenon from the pool, and conduct
an appropriately-designed study. The investigator will test the hypothesis
that the phenomenon exists, and calculate a *p*-value. Let's suppose
that this is all done properly (no dead fish here), so that
the *p*-value is uniformly distributed between 0 and 1 when the
hypothesis is false and the phenomenon does not exist. The investigator writes
up the report and submits it for publication.

What happens next depends on whether the phenomenon has entered the
published literature already or not. If it has, the new *p*-value is
allowed to be published. If it has not, the report is published if, and only
if, the *p*-value is < 0.05. This is the "file-drawer problem": finding
a lack of evidence for a phenomenon is publication-worthy only if people
thought it existed.

The community combines the published *p*-values in some fashion
— reasonably exact solutions to this problem were devised by
R. A. Fisher and Karl Pearson in the 1930s, leading to Neyman's smooth test of
goodness of fit, but I have been told by a psychologist that "of course" one
should just use the median of the published *p*-values. Different rules
of combination will lead to slightly different forms of this model.

The last assumption of the model is that, sadly, *none* of the
phenomena the community is interested in exist. *All* of their null
hypotheses are, strictly speaking, true. Just as neutral models of evolution
are ones which have all sorts of evolutionary mechanisms except selection, this
is a model of the scientific process without discovery. Since, by assumption,
everyone does their calculations correctly and honestly, if we could look at
all the published and unpublished *p*-values they'd be uniformly
distributed between 0 and 1. But the first *published* *p*-value
for any phenomenon is uniformly distributed between 0 and 0.05. A full 2% of
initial announcements will have an impressive-seeming (nominal) significance
level of 10^{-3}.

Of course, when people try to replicate those initial findings,
*their* *p*-values will be distributed between 0 and 1. The
joint distribution of *p*-values from the initial study and *m*
attempts at replication will be a product of independent uniforms, one on [0,
0.05] and *m* of them on [0,1]. What follows from this will depend on
the exact rule used to aggregate individual studies, and on doing some
calculations I have never pushed through, so I will structure it as a series of
"exercises for the reader".

- Pick your favorite meta-analytic rule for aggregating
*p*-values. (If you do not have a favorite rule, one will be issued to you.) What is the distribution of the aggregate*p*-value after*m*replications? - Say that a phenomenon is dropped from the literature when its
aggregate
*p*-value climbs above 0.05. Find the probability of being dropped as a function of*m*. - Say that the lifespan of a phenomenon is the number of replications it
receives before being dropped from the literature. (Under any sensible
aggregation rule, the probability of being dropped will tend towards 1
as
*m*increases, so lifespans will be finite.) Find the distribution of lifespans. - Let us take any field of inquiry; say, to be diplomatic, haruspicy. Surveying all the published claims of phenomena in its literature, how many replications have they survived? Does this look at all different from the distribution of lifespans under the neutral model? How much nudging of marginal results below the 5% threshold would be needed to account for the discrepancy? (After all, "the difference between 'significant' and 'not significant' is not itself statistically significant".) Does the literature, in other words, provide any evidence that the discipline knows anything at all?

Let me draw the moral. Even if the community of inquiry is both too clueless
to make any contact with reality and too honest to nudge borderline findings
into significance, so long as they can keep coming up with new phenomena to
look for, the mechanism of the file-drawer problem *alone* will
guarantee a steady stream of new results. There is, so far as I know,
no Journal of Evidence-Based Haruspicy filled, issue after issue,
with methodologically-faultless papers reporting the ability of sheeps' livers
to predict the winners of sumo championships, the outcome of speed dates, or
real estate trends in selected suburbs of Chicago. But the difficulty can only
be that the evidence-based haruspices aren't trying hard enough, and some
friendly rivalry with
the plastromancers is
called for. It's true that none of these findings will last forever, but this
constant overturning of old ideas by new discoveries is just part of what makes
this such a dynamic time in the field of haruspicy. Many scholars will even
tell you that their favorite part of being a haruspex is the frequency with
which a new sacrifice over-turns everything they thought they knew about
reading the future from a sheep's liver! We are very excited about the renewed
interest on the part of policy-makers in the recommendations of the mantic
arts...

**Update**, later that same day: I *meant* to mention
this classic paper on
the file-drawer problem, but forgot because I was writing at one in the
morning.

**Update**, yet later: sense-negating typo fixed, thanks
to Gustavo Lacerda.

*Manual
trackback*: Wolfgang
Beirl; Matt McIrvin's
Steam-Operated World of Yesteryear;
Idiolect;
Cognition and Culture;
Brad DeLong;
Dynamic Ecology

Modest Proposals; Learned Folly; The Collective Use and Evolution of Concepts; Enigmas of Chance

Posted at November 16, 2010 01:30 | permanent link