## Complexity, Entropy and the Physics of `gzip`

*02 Dec 2003 13:38*

Recently, a lot of papers have been published in the physics literature
where people try to use standard file-compression algorithms, particularly the
`gzip` implementation of the Lempel-Ziv algorithm, to estimate the
algorithmic information content
(a.k.a. Kolmogorov complexity) of data-sources, and/or their entropy rates.
These are both very bad ideas. [I will add references to the offending papers
later.] My apologies, before we go any further, to the actual "complexity,
entropy and physics of information" group, which knows better than this!

I wrote a dead-simple program which uses an additive pseudo-random number
generator to produce a sequence of ASCII characters, either 0 or 1. The size
of the program, compiled on my desktop Linux box, was 15574 bytes. This
includes a particular *fixed* seed for the random number generator. The
program took one input, which was the number of characters N to output, which
can be encoded in (approximately) log(N) bits. So the algorithmic information
content of the output is at most 15574 + (1/8)log(N) bytes. (All logarithms
are of course to base 2.)

Having generated pseudo-random strings from this program, I then ran
`gzip` on them. One would hope for a compression ratio of approximately
1/8, at least for large strings, because ASCII uses eight bits per character,
whereas there are really only two possibilities in the output. (A better
comparison would've been for me to write directly to a binary file, without
using ASCII, but I'm lazy; I'll get to that later.) The original data can then
be regenerated from the combination of the compressed file and the
`gunzip` program. On my system, that program is 53716 bytes in size.
So the "zipper" estimate of the algorithmic information content of the output
of my pseudo-random number generator is as follows. (I have rounded up all
fractions to the next largest byte, to try and reduce the advantage of the
actual generating program.)

N | gzip | generator |

10^1 | 53,760 | 15,575 |

10^2 | 53,793 | 15,575 |

10^3 | 53,959 | 15,576 |

10^4 | 55,409 | 15,576 |

10^5 | 69,686 | 15,577 |

10^6 | 212,641 | 15,577 |

10^7 | 1,643,233 | 15,577 |

10^8 | 15,949,691 | 15,578 |

I didn't run it beyond that size, because my sysadmins would not be especially happy with my producing a a gigabyte of garbage to prove this point; some other time.

What this shows is that, even with a hundred million data points,
`gzip` is not very good *at all* at finding a short compression
of what is, after all, a linear dynamical system with some fairly simply
nonlinearities added on top. As an estimator of the Kolmogorov complexity, it
sucks rocks.

Another purported use of `gzip` is to estimate the entropy rate of
the source. This is done either by dividing the length of the `gzip`ped
output by N (call this the division methd), or as the incremental increase in
the length of the compressed file when increasing N by 1 (call this the
marginal method). To be fair to the marginal method, one should really look at
the increase for an additional m characters, divided by m, so I do this for m =
10, m = 100 and m = 1000. My pseudo-random number generator is periodic, so
its entropy rate is exactly 0. Here is how both of these methods ("division"
and "marginal", respectively) did at estimating 0 bytes per character (to three
decimal places, rounded):

N | division | marginal (m=10) | (m=100) | (m=1000) |

10^1 | 4.400 | 1.000 | 0.400 | 0.206 |

10^2 | 0.770 | 0.600 | 0.250 | 0.188 |

10^3 | 0.243 | 0.500 | 0.210 | 0.175 |

10^4 | 0.169 | 0.500 | 0.210 | 0.163 |

10^5 | 0.160 | 0.400 | 0.200 | 0.166 |

10^6 | 0.159 | 0.100 | 0.200 | 0.165 |

10^7 | 0.159 | 0.600 | 0.180 | 0.163 |

10^8 | 0.159 | 0.200 | 0.160 | 0.161 |

Note that even if one pretended this was a genuine source of independent,
identically-distributed coin flips, the entropy rate should be 0.125 bytes per
character, not approximately 0.16 bytes. It is exceedingly improbably that
I've generated a sample which happens to look like it has an entropy rate that
high. (Back-of-the-envelope large
deviations calculation: the entropy of the empirical distribution relative
to the true one would be on the order of 0.159-0.125 = 0.034 bytes per symbol,
or 0.272 bits, so the probability would be approximately 2^(-0.272 N) =
10^(-0.082 N) for large N. This is a little dodgy because I should really
search for the relative-entropy-minimizing distribution, but, again, this is a
first cut.) Now, Lempel-Ziv did prove that, asymptotically, the compressed
length divided by the number of characters will be no larger than the entropy
rate. So the division method, at least, should asymptote to the entropy rate,
and you can persuade yourself that this means the marginal method then should
converge too. But the key word here, again and again,
is *asymptotically*: we have here a hundred million data points, but are
evidently very, very far from the asymptote. I for one do not find it
plausible that vastly smaller empirical data sets reach the asymptotic
regime.

There are, of course, other ways of measuring quantities like the entropy rate. As it happens, I was involved in writing a program (CSSR) for building hidden Markov models from sequential data, and part of the output of the program is a calculation of the entropy rate. Here is what CSSR gives when it's configured to just fit a simple Markov chain to the data (Lmax = 1); results aren't notably different if I allow it to seek higher-order models.

N | CSSR |

10^1 | 0.118682 |

10^2 | 0.123229 |

10^3 | 0.124990 |

10^4 | 0.124999 |

10^5 | 0.125000 |

10^6 | 0.125000 |

10^7 | 0.125000 |

10^8 | 0.125000 |

The point, of course, is not that Lempel-Ziv is a bad compression algorithm --- it's not! --- but that it was never designed to be used to estimate entropy rates (which can be done better using other tools) or algorithmic complexities (which in generally cannot be done). The best screwdriver in the world makes a very bad butter-knife.

A possible objection is that a pseudorandom number generator is, of course,
hard, and gives a much more convincing appearance of having no structure
than one might expect from typical data-sets. So, at some convenient
later time, I'm going to try to supplement these (elementary) results with
ones from some non-trivial dynamical system, and show that `gzip`
doesn't do any better there, either.

**Update, 2 December 2003:** There has been a surprisingly
large (and positive) response to this. Matthew Berryman points out that
`gzip` includes some metadata in its compressed files, which presumably
should be discounted; very true, but it won't affect the results substantially.
More seriously, he reminds me that the Lempel-Ziv convergence result requires
not just that the length of the data set grow without bound, but that the
length of the phrases into which it is parsed must be unbounded, and that
`gzip` is implemented with a fixed maximum phrase length, considerably
smaller than 10^8. I hadn't realized there was a limit on the phrase length,
and suspect this is why the large-N entropy rate is so bad. A different
implementation of Lempel-Ziv, without that limit, would do better at that, but
that doesn't help the published papers using `gzip`, and it's still a
dumb thing to do. Derek Abbott suggests that the whole issue could be avoided
by replacing Shannon entropy with gz entropy, defined as the result of applying
`gzip.` This elegant suggestion "has all the advantages of theft
over honest toil", and accordingly I urge Derek to submit it to
Physical Review Letters before somebody beats him to this prize.
Kirk Stork wants to know what kind of results I'd get on the random bytes
available from random.org, produced by
sampling actual atmospheric noise; I'll post those here once I do that.
Several people suggested using the logistic map at parameter settings where the
true entropy rate is known (to some accuracy); this will be easy. Finally, a
number of people, whose names I elide out of courtesy, suggested replicating
this analysis on the output of the postmodernism
generator.

**Update, 16 April 2015**: I should of course have given
references to actual papers making these points. The first which come to mind
are Khmelev and Teahan's "On an Application of Relative Entropy"
(arxiv:cond-mat/0205521)
and Goodman's "Extended Comment on 'Language Trees and Zipping'"
(arxiv:cond-mat/0202383).

*See also*: Estimating Entropies
and Informations