Notebooks

Algorithmic Information Theory

Last update: 08 Dec 2024 01:16
First version: 5 May 2015

Fix your favorite universal computer \( U \). Now consider any string of symbols \( x \), or anything which we can describe by means of a string of symbols. Among all the programs which can be run on \( U \), there will be some which will produce \( x \) and then stop. There will therefore be a minimum length for such programs. Call this length \( K_U(x) \). This is the "algorithmic information content" of \( x \) (relative to \( U \)), or its Kolmgorov (-Chaitin-Solomonoff) complexity.

Note that one program which could produce \( x \) would be
print(\( x \)); stop;
(or its equivalent in the language of \( U \)), so \( K_U(x) \) can't be much bigger than \( |x| \), the length of \( x \). Sometimes, clearly, \( K_U(x) \) can be much smaller than \( x \): if \( x \) just consists of the same symbol repeated over and over, say 0, we could use the program for (i in 1:n) { print("0"); }; stop;
and give it n=\( |x| \), for a total length of a constant (the loop-and-print bit) plus \( \log{|x|} \) (the number of symbols needed to write the length of \( x \)). While precise values will, irritatingly, be relative to the universal computer \( U \), because the computer is universal, it can emulate any other computer \( V \) with a finite-length program which we might call \( K_U(V) \). Hence \[ K_U(x) \leq K_V(x) + K_U(V), \] and algorithmic information content differs by at most an additive, string-independent constant across computers.

It turns out that the quantity \( K_U(x) \) shares many formal properties with the (Shannon) entropy of a random variable, typically written \( H[X] \), so that one can define analogs of all the usual information-theoretic quantities in purely algorithmic terms. Along these lines, we say that \( x \) is "incompressible" if \( K_U(x) \approx |x| \). What makes this mathematically interesting is that incompressible sequences turn out to provide a model of sequences of independent and identically distributed random variables, and, vice versa, the typical sample path of an IID stochastic process is incompressible. This generalizes: almost every trajectory of an ergodic stochastic process has a Kolmogorov complexity whose growth rate equals its entropy rate (Brudno's theorem). This, by the way, is why I don't think Kolmogorov complexity makes a very useful complexity measure: it's maximal for totally random things! But Kolmogorov complexity would be a good measure of (effective) stochasticity, if we could actually calculate it.

Unfortunately, we cannot calculate it. There is a lovely little proof of this, due to Nohre, which I learned of from a paper by Rissanen, and can't resist rehearsing. (There are older proofs, but they're not so pretty.) Suppose there was a program \( P \) which could read in an object \( x \) and return its algorithmic information content relative to some universe computer, so \( P(X) = K_U(x) \). We now use \( P \) to construct a new program \( V \) which compresses incompressible objects.

  1. Sort all sequences by length, and then alphabetically.
  2. For the \( i^{\mathrm{th}} \) sequence \( x^i \), use \( P \) to find \( K_U(x^i) \).
  3. If \( K_U(x^i) \leq |V| \), then keep going.
  4. Otherwise, set \( z \) to \( x^i \), return \( z \), and stop.
\( V \) outputs \( z \) and stops, so \( K_U(z) \leq |V|\), but, by construction, \( K_U(z) > |V| \), a contradiction. The only way out of this contradiction is to deny the premise that \( P \) exists: there is no algorithm which calculates the Kolmogorov complexity of an arbitrary string. You can in fact strengthen this to say that there is no algorithm which approximates \( K_U \). (In particular, you can't approximate it with gzip.) \[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Expected Kolmogorov complexity vs. Shannon entropy Up to (unimportant) additive constants, \( \Expect{K_U(X)} = H[X] \). To prove this, start with the fact that \( H[X] \) minimizes the expected code length for any coding scheme for \( X \). That is, pick any invertible mapping \( c \) from the values of \( X \), \( \mathcal{X} \), to binary sequences. Say \( \ell(c(x)) \) is the length of the binary encoding of \( x \). Then \( \Expect{\ell(c(X))} \) is the expected code length, and (i) \( \Expect{\ell(c(X))} \geq H[X] \), and (ii) there are coding schemes which come arbitrarily close to the lower bound (while still being uniquely decode-able). Taking this as given, I make two claims:

  1. \( \Expect{K_U(X)} \geq H[X] \): The mapping from \( x \) to the program on \( U \) which generates \( x \) (and then stops) is a binary encoding of \( x \). \( K_U(x) \) is the length of that encoding. Since \( H[X] \) is a lower bound on the expected code length, it must be a lower bound on expected program length.
  2. \( \Expect{K_U(X)} \leq H[X] + \text{constant} \): Given a coding/decoding scheme which achieves the Shannon lower bound, there is a finite program \( D \) on \( U \) which implements the decoding half of the scheme, and the same program \( D \) can be used for all \( x \). Therefore, a program on \( U \) which will generate \( x \) and then stop is \( D \) plus the Shannon coding of \( x \). The length of such a program is \( |D| + \) length of the encoding of \( x \), and the expected length, over random choices of \( X \), is therefore \( |D| + H[X] \).
Combining the two claims, \( \Expect{K_U(X)} = H[X] \) up to an additive constant.

See also: Complexity Measures; Ergodic Theory; Information Theory; the Minimum Description Length Principle; Probability; "Occam"-style Bounds for Long Programs


Notebooks: