So You Think You Have a Power Law — Well Isn't That Special?
Regular readers who care about such things — I think there are about
three of you — will recall that I
have long had a
thing about just how unsound many of the claims for the presence
of power law
distributions in real data are, especially those made by theoretical
physicists, who, with some honorable exceptions, learn nothing about data
analysis. (I certainly didn't.) I have even whined about how I should really
be working on a paper about how to do all this right, rather than merely
snarking in a weblog. As evidence that the age of wonders is not passed
— and, more relevantly, that I have productive collaborators — this
paper is now loosed upon the world:
- Aaron Clauset, CRS
and M. E. J. Newman,
"Power-law distributions in empirical
data", arxiv:0706.1062, with
code available
in Matlab and R; forthcoming (2009) in SIAM Review
- Abstract: Power-law distributions occur in many situations of
scientific interest and have significant consequences for our understanding of
natural and man-made phenomena. Unfortunately, the empirical detection and
characterization of power laws is made difficult by the large fluctuations that
occur in the tail of the distribution. In particular, standard methods such as
least-squares fitting are known to produce systematically biased estimates of
parameters for power-law distributions and should not be used in most
circumstances. Here we describe statistical techniques for making accurate
parameter estimates for power-law data, based on maximum likelihood methods and
the Kolmogorov-Smirnov statistic. We also show how to tell whether the data
follow a power-law distribution at all, defining quantitative measures that
indicate when the power law is a reasonable fit to the data and when it is
not. We demonstrate these methods by applying them to twenty-four real-world
data sets from a range of different disciplines. Each of the data sets has been
conjectured previously to follow a power-law distribution. In some cases we
find these conjectures to be consistent with the data while in others the power
law is ruled out.
The paper is deliberately aimed at physicists, so we assume some things that
they know (like some of the mechanisms, e.g. critical fluctuations, which can
lead to power laws), and devote extra detail to things they don't but which
e.g. statisticians do know (such as how to find the cumulative distribution
function of a standard Gaussian). In particular, we refrained from making a
big deal about the need for
an error-statistical approach to
problems like this, but it
definitely shaped
our
thinking.
Aaron has
already posted
about the paper, but I'll do so myself anyway. Partly this is to help hammer
the message home, and partly this is because I am a basically negative and
critical man, and this sort of work gives me an excuse to vent my feelings of
spite under the pretense of advancing truth (unlike Aaron and Mark, who are
basically nice guys and constructive scholars).
Here are the take-home points, none of which ought to be news, but
which, taken together, would lead to a real change in the literature. (For
example, half or more each issue of Physica A would disappear.)
- Lots of distributions give you straight-ish lines on a log-log
plot. True, a Gaussian or a Poisson won't, but lots of other things
will. Don't even begin to talk to me about log-log plots which you claim are
"piecewise linear".
- Abusing linear regression makes the baby Gauss
cry. Fitting a line to your log-log plot by least squares is a bad
idea. It generally doesn't even give you a probability distribution, and even
if your data do follow a power-law distribution, it gives you a bad estimate of
the parameters. You cannot use the error estimates your regression software
gives you, because those formulas incorporate assumptions which directly
contradict the idea that you are seeing samples from a power law. And no, you
cannot claim that because the line "explains" (really, describes) a lot of the
variance that you must have a power law, because you can get a very high R^2
from other distributions (that test has no "power"). And this is without
getting into the additional errors caused by trying to fit a line to binned
histograms.
It's true that fitting lines on log-log graphs is what Pareto did back in
the day when he started this whole power-law business, but "the day" was
the 1890s. There's a time and a place for being old school; this
isn't it.
- Use maximum likelihood to estimate the scaling exponent.
It's fast! The formula is easy! Best of all, it works! The method
of maximum likelihood was invented in 1922
[parts 1
and 2],
by someone
who studied statistical mechanics, no less. The maximum likelihood estimators
for the discrete (Zipf/zeta) and continuous (Pareto) power laws were worked out
in 1952
and 1957 (respectively). They converge on the correct value of the scaling
exponent with probability 1, and they do so efficiently. You can even work out
their sampling distribution (it's
an inverse
gamma) and so get exact confidence intervals. Use the MLEs!
- Use goodness of fit to estimate where the scaling region
begins. Few people pretend that the whole of their data-set
follows a power law distribution; usually the claim is about the right or upper
tail, the large values over some given threshold. This ought to raise the
question of where the tail begins. Usually people find it by squinting at
their log-log plot. Mark Handcock and James Jones,
in one of the few
worthwhile efforts here, suggested using Schwarz's
information criterion. This isn't bad, but has trouble with with
continuous data. Aaron devised an ingenious method which finds the
empirically-best scaling region, by optimizing the Kolmogorov-Smirnov
goodness-of-fit statistic; it performs slightly better than the information
criterion.
(Yes, one could imagine more elaborate semi-parametric approaches to this
problem. Feel free to go ahead and implement them.)
- Use a goodness-of-fit test to check goodness of fit. In
particular, if you're looking at the goodness of fit of
a distribution, use a statistic meant for distributions, not
one for regression curves. This means forgetting about R^2, the fraction of
variance accounted for by the curve, and using the Kolmogorov-Smirnov
statistic, the maximum discrepancy between the empirical distribution and the
theoretical one. If you've got the right theoretical distribution, KS
statistic will converge to zero as you get more data (that's the
Glivenko-Cantelli
theorem). The one hitch in this case is that you can't use the usual
tables/formulas for significance levels, because you're estimating the
parameters of the power law from the data. (If you really want to see where
the problem comes from,
see Pollard,
starting on p. 99.) This is why God, in Her wisdom and mercy, gave
us the
bootstrap.
If the chance of getting data which fits the estimated distribution as
badly as your data fits your power law is, oh, one in a thousand or less, you
had better have some other, very compelling reason to think that
you're looking at a power law.
- Use Vuong's test to check alternatives, and be prepared for
disappointment. Even if you've estimated the parameters of your
parameters properly, and the fit is decent, you're not done yet. You also need
to see whether other, non-power-law distributions could have produced
the data. This is a
model selection
problem, with the complication that possibly neither the power law nor the
alternative you're looking at is exactly right; in that case you'd at least
like to know which one is closer to the truth. There is a brilliantly simple
solution to this problem (at least for cases like this) which was first devised
by Quang Vuong in a 1989 Econometrica
paper: use the log-likelihood ratio, normalized by an estimate of the
magnitude of the fluctuations in that ratio. Vuong showed that this test
statistic asymptotically has a standard Gaussian distribution when the
competing models are equally good; otherwise it will almost surely converge on
picking out the better model. This is extremely clever and deserves to be much
better known. And, unlike things like the fit to a log-log regression line, it
actually has the power to discriminate among the alternatives.
If you use sensible, heavy-tailed alternative distributions, like the
log-normal or the Weibull (stretched exponential), you will find that it is
often very, very hard to rule them out. In the two dozen data sets we looked
at, all chosen because people had claimed they followed power laws, the
log-normal's fit was almost always competitive with the power law, usually
insignificantly better and sometimes substantially better.
(To repeat a joke: Gauss is not
mocked.)
For about half the data sets, the fit is substantially improved by adding
an exponential cut-off to the power law. (I'm too lazy to produce the
necessary equations; read the paper.) This means that there is a
characteristic scale after all, and that super-mega-enormous, as opposed to
merely enormous, events are, indeed, exponentially rare. Strictly speaking, a
cut-off power law should always fit the data better than a pure one
(just let the cut-off scale go to infinity, if need be), so you need to be a
little careful in seeing whether the improvement is real or just noise; but
often it's real.
- Ask yourself whether you really care. Maybe you
don't. A lot of the time, we
think, all that's genuine important is that the tail is heavy, and it
doesn't really matter whether it decays linearly in the log of the variable
(power law) or quadratically (log-normal) or something else. If that's all
that matters, then you should really consider doing some kind of non-parametric
density estimation
(e.g. Markovitch and Krieger's
[preprint]).
Sometimes, though, you do care. Maybe you want to make a claim which
depends heavily on just how common hugely large observations are. Or maybe you
have a particular model in mind for the data-generating process, and that model
predicts some particular distribution for the tail. Then knowing whether it
really is a power law, or closer to a power law than (say) a stretched
exponential, actually matters to you. In that case, you owe it to yourself to
do the data analysis right. You also owe it to yourself to think carefully
about whether there are other ways of checking your model. If the
only testable prediction it makes is about the shape of the tail,
it doesn't sound like a very good model, and it will be intrinsically
hard to check it.
Because this is, of course, what everyone ought to do with a computational
paper, we've put our code online, so you can
check our calculations, or use these methods on your own data, without having to
implement them from scratch. I trust that I will no longer have to referee
papers where people use GnuPlot to draw
lines on log-log graphs, as though that meant something, and that in five to
ten years even science journalists and editors
of Wired will begin to get the message.
Manual trackbacks: The Statistical
Mechanic; Uncertain
Principles;
zs;
LanguageLog;
Science After Sunclipse;
Philosophia
Naturalis 11
(at Highly
Allochthonous);
Langreiter;
blogs for
industry ... blogs for the dead;
Infectious Greed;
No Free Lunch;
Look Here First; TPMCafe;
Science After Sunclipse;
Cosmic Variance;
Messy Matters;
AI and Social
Science
Power Laws;
Enigmas of Chance;
Complexity
Posted at June 15, 2007 13:00 | permanent link