Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenAlgorithmic Information Theory
http://bactra.org/notebooks/2024/04/24#algorithmic-information-theory
<P>Fix your favorite <a href="computation.html">universal computer</a> \( U \).
Now consider any string of symbols \( x \), or anything which we can describe by
means of a string of symbols. Among all the programs which can be run on \( U \),
there will be some which will produce \( x \) and then stop. There will therefore
be a minimum length for such programs. Call this length \( K_U(x) \). This is
the "algorithmic information content" of \( x \) (relative to \( U \)), or its
Kolmgorov (-Chaitin-Solomonoff) complexity.
<P>Note that one program which could produce \( x \) would
be
<br> <tt>print(</tt>\( x \)</tt><tt>); stop;</tt>
<br> (or its equivalent in the language of
\( U \)), so \( K_U(x) \) can't be much bigger than \( |x| \), the length of \( x \).
Sometimes, clearly, \( K_U(x) \) can be much smaller than \( x \): if \( x \) just consists
of the same symbol repeated over and over, say 0, we could use the
program
<r> <tt>for (i in 1:n) { print("0"); }; stop;</tt>
<br>and give
it <tt>n=</tt>\( |x| \), for a total length of a constant (the loop-and-print bit)
plus \( \log{|x|} \) (the number of symbols needed to write the length of \( x \)).
While precise values will, irritatingly, be relative to the universal computer
\( U \), because the computer is universal, it can emulate any other computer \( V \)
with a finite-length program which we might call \( K_U(V) \). Hence
\[
K_U(x) \leq K_V(x) + K_U(V),
\]
and algorithmic information content differs by at most an
additive, string-independent constant across computers.
<P>It turns out that the quantity \( K_U(x) \) shares many formal properties
with the (Shannon) entropy of a random variable, typically written \( H[X] \), so
that one can define analogs of all the
usual <a href="information-theory.html">information-theoretic</a> quantities in
purely algorithmic terms. Along these lines, we say that \( x \) is
"incompressible" if \( K_U(x) \approx |x| \). What makes this mathematically
interesting is that incompressible sequences turn out to provide a model of
sequences of independent and identically distributed random variables, and,
vice versa, the typical sample path of an IID stochastic process is
incompressible. This generalizes: almost every trajectory of
an <a href="ergodic-theory.html">ergodic</a> stochastic process has a
Kolmogorov complexity whose growth rate equals its entropy rate (Brudno's
theorem). This, by the way, is why I don't think Kolmogorov complexity makes a
very useful complexity measure: it's maximal for totally random things!
But Kolmogorov complexity would be a good measure of (effective) stochasticity, if we could
actually calculate it.
<P>Unfortunately, we cannot calculate it. There is a lovely little proof of
this, due to Nohre, which I learned of from a paper by Rissanen, and can't
resist rehearsing. (There are older proofs, but they're not so pretty.)
Suppose there was a program \( P \) which could read in an object \( x \) and
return its algorithmic information content relative to some universe computer,
so \( P(X) = K_U(x) \). We now use \( P \) to construct a new program \( V \)
which compresses incompressible objects.
<ol>
<li> Sort all sequences by length, and then alphabetically.
<li> For the \( i^{\mathrm{th}} \) sequence \( x^i \), use \( P \) to find \( K_U(x^i) \).
<li> If \( K_U(x^i) \leq |V| \), then keep going.
<li> Otherwise, set \( z \) to \( x^i \), return \( z \), and stop.
</ol>
\( V \) outputs \( z \) and stops, so \( K_U(z) \leq |V|\), but, by construction, \(
K_U(z) > |V| \), a contradiction. The only way out of this contradiction
is to deny the premise that \( P \) exists: there is no algorithm which calculates
the Kolmogorov complexity of an <em>arbitrary</em> string. You can in
fact strengthen this to say that there is no algorithm which <em>approximates</em> \( K_U \). (In particular,
you <a href="cep-gzip.html">can't approximate it with <tt>gzip</tt></a>.)
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\]
<P><strong>Expected Kolmogorov complexity vs. Shannon entropy</strong>
Up to (unimportant) additive constants, \( \Expect{K_U(X)} = H[X] \). To prove
this, start with the fact that \( H[X] \) minimizes the expected code length
for any coding scheme for \( X \). That is, pick any invertible mapping \( c \) from the values of \( X \), \( \mathcal{X} \), to binary sequences. Say \( \ell(c(x)) \) is the length of the binary encoding of \( x \). Then \( \Expect{\ell(c(X))} \) is the expected code length, and (i) \( \Expect{\ell(c(X))} \geq H[X] \), and (ii) there are coding schemes which come arbitrarily close to the lower
bound (while still being uniquely decode-able). Taking this as given, I make two claims:
<ol>
<li> \( \Expect{K_U(X)} \geq H[X] \): The mapping from \( x \) to the program on \( U \) which generates \( x \) (and then stops) is a binary encoding of \( x \). \( K_U(x) \) is the length of that encoding. Since \( H[X] \) is a lower bound on the expected code length, it must be a lower bound on expected program length.
<li> \( \Expect{K_U(X)} \leq H[X] + \text{constant} \): Given a coding/decoding
scheme which achieves the Shannon lower bound, there is a <em>finite</em>
program \( D \) on \( U \) which implements the decoding half of the scheme,
and the same program \( D \) can be used for all \( x \). Therefore, a program
on \( U \) which will generate \( x \) and then stop is \( D \) plus the Shannon coding of \( x \). The length of such a program is \( |D| + \) length of the encoding of \( x \), and the expected length, over random choices of \( X \), is therefore \( |D| + H[X] \).
</ol>
Combining the two claims, \( \Expect{K_U(X)} = H[X] \) up to an additive constant.
<P>See also:
<a href="complexity-measures.html">Complexity Measures</a>;
<a href="ergodic-theory.html">Ergodic Theory</a>;
<a href="information-theory.html">Information Theory</a>;
the <a href="mdl.html">Minimum Description Length Principle</a>;
<a href="probability.html">Probability</a>;
<a href="occam-bounds-for-long-programs.html">"Occam"-style Bounds for Long Programs</a>
<ul>Recommended, big picture:
<li>Cover and Thomas, <cite>Elements of Information Theory</cite> [specifically the chapter on Kolmogorov complexity]
<li>Ming Li and Paul M. B. Vitanyi, <cite>An Introduction to Kolmogorov Complexity and Its Applications</cite>
</ul>
<ul>Recommended, close-ups:
<li>Peter Gacs, John T. Tromp and Paul M. B. Vitanyi, "Algorithmic Statistics", <cite>IEEE Transactions on Information Theory</cite> <strong>47</strong> (2001): 2443--463</a>, <a href="http://arxiv.org/abs/math.PR/0006233">arxiv:math.PR/0006233</a>
<li>Stefano Galatolo, Mathieu Hoyrup, and Cristóbal Rojas,
"Effective symbolic dynamics, random points, statistical behavior, complexity
and entropy", <a href="http://arxiv.org/abs/0801.0209">arxiv:0801.0209</a>
[<em>All</em>, not almost all, Martin-Lof points are statistically
typical.]
<li>Jan Lemeire, Dominik Janzing, "Replacing Causal Faithfulness with Algorithmic Independence of Conditionals", <a href="http://dx.doi.org/10.1007/s11023-012-9283-1"><cite>Minds and Machines</cite> <strong>23</strong> (2013): 227--249</a>
<li>G. W. Müller, "Randomness and extrapolation",
<a href="http://projecteuclid.org/euclid.bsmsp/1200514209"><cite>Proceedings of
the Sixth Berkeley Symposium on Mathematical Statistics and Probability</cite>,
Vol. 2 (Univ. of Calif. Press, 1972), 1--31</a> [On a notion of randomness
supposedly related to, but stronger than, that of Martin-Löf.]
<li>Tom F. Sterkenburg, "Solomonoff Prediction and Occam's Razor",
<a href="https://doi.org/10.1086/687257"><cite>Philosophy of Science</cite> <strong>83</strong> (2016): 459--479</a>, <a href="http://philsci-archive.pitt.edu/12429/">phil-sci/12429</a>
<li>Bastian Steudel, Nihat Ay, "Information-theoretic inference of common ancestors", <a href="http://arxiv.org/abs/1010.5720">arxiv:1010.5720</a>
<li>Paul M. B. Vitanyi and Ming Li, "Minimum Description Length
Induction, Bayesianism, and Kolmogorov Complexity", <cite>IEEE Transactions on
Information Theory</cite> <strong>46</strong> (2000): 446--464, <a
href="http://arxiv.org/abs/cs.LG/9901014">cs.LG/9901014</a>
<li>Vladimir Vovk, "Superefficiency from the Vantage Point of Computability", <a href="http://dx.doi.org/10.1214/09-STS279"><cite>Statistical Science</cite> <strong>24</strong> (2009): 73--86</a>
</ul>
<ul>Modesty forbids me to recommend:
<li>The notes and slides from lecture 8 in my <a href="http://www.stat.cmu.edu/~cshalizi/462/syllabus.html">"Chaos, Complexity and Inference" class</a>
</ul>
<ul>To read:
<li>Luis Antunes, Bruno Bauwens, Andre Souto, Andreia Teixeira, "Sophistication vs Logical Depth", <a href="http://arxiv.org/abs/1304.8046">arxiv:1304.8046</a>
<li>John C. Baez, Mike Stay, "Algorithmic Thermodynamics", <a href="http://arxiv.org/abs/1010.2067">arxiv:1010.2067</a>
<li>George Barmpalias and Andrew Lewis-Pye, "Compression of Data Streams Down to Their Information Content", <a href="http://dx.doi.org/10.1109/TIT.2019.2896638">IEEE Transactions on Information Theory</cite> <strong>65</strong> (2019): 4471--4485</a>
<li>Fabio Benatti, Tyll Krueger, Markus Mueller, Rainer
Siegmund-Schultze and Arleta Szkola, "Entropy and Algorithmic Complexity in
Quantum Information Theory: a Quantum Brudno's Theorem", <a
href="http://arxiv.org/abs/quant-ph/0506080">quant-ph/0506080</a>
<li>Laurent Bienvenu, Adam Day, Mathieu Hoyrup, Ilya Mezhirov, Alexander Shen, "A constructive version of Birkhoff's ergodic theorem for Martin-Löf random points", <a href="http://arxiv.org/abs/1007.5249">arxiv:1007.5249</a>
<li>Laurent Bienvenu, Peter Gacs, Mathieu Hoyrup, Cristobal Rojas, Alexander Shen, "Algorithmic tests and randomness with respect to a class of measures", <a href="http://arxiv.org/abs/1103.1529">arxiv:1103.1529</a>
<li>Laurent Bienvenu, Rod Downey, "Kolmogorov Complexity and Solovay Functions", <a href="http://arxiv.org/abs/0902.1041">arxiv:0902.1041</a>
<li>Laurent Bienvenu, Alexander Shen, "Algorithmic information theory and martingales", <a href="http://arxiv.org/abs/0906.2614">arxiv:0906.2614</a>
<li>Claudio Bonanno, "The Manneville map: topological, metric and
algorithmic entropy,"
<a href="http://arxiv.org/abs/math.DS/0107195">math.DS/0107195</a>
<li>Claudio Bonanno and Pierre Collet, "Complexity for Extended Dynamical Systems", <a href="http://dx.doi.org/10.1007/s00220-007-0313-4"><cite>Communications in Mathematical Physics</cite> <strong>275</strong> (2007): 721--748</a>, <a href="http://arxiv.org/abs/math/0609681">math/0609681</a>
<li><cite>The Computer Journal</cite>, <strong>42:4</strong> (1999)
[Special issue on Kolmogorov complexity and inference]
<li>Boris Darkhovsky, Alexandra Pyriatinska, "Epsilon-complexity of continuous functions", <a href="http://arxiv.org/abs/1303.1777">arxiv:1303.1777</a>
<li>Łukasz Dębowski, "Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited", <a href="https://doi.org/10.3390/e20020085"><cite>Entropy</cite> <strong>20</strong> (2018): 85</a>
<li>David Doty, "Every sequence is compressible to a random one",
<a href="http://arxiv.org/abs/cs.IT/0511074">cs.IT/0511074</a> ["Kucera and
Gacs independently showed that every infinite sequence is Turing reducible to a
Martin-Lof random sequence. We extend this result to show that every infinite
sequence S is Turing reducible to a Martin-Lof random sequence R such that the
asymptotic number of bits of R needed to compute n bits of S, divided by n, is
precisely the constructive dimension of S."]
<li>David Doty and Jared Nichols, "Pushdown Dimension",
<a href="http://arxiv.org/abs/cs.IT/0504047">cs.IT/0504047</a>
<li>Peter Gács, "Uniform test of algorithmic randomness over a
general space", <a
href="http://dx.doi.org/10.1016/j.tcs.2005.03.054"><cite>Theoretical Computer
Science</cite> <strong>341</strong> (2005): 91--137</a> ["The algorithmic
theory of randomness is well developed when the underlying space is the set of
finite or infinite sequences and the underlying probability distribution is the
uniform distribution or a computable distribution. These restrictions seem
artificial. Some progress has been made to extend the theory to arbitrary
Bernoulli distributions (by Martin-Lof) and to arbitrary distributions (by
Levin). We recall the main ideas and problems of Levin's theory, and report
further progress in the same framework...."]
<li>Travis Gagie, "Compressing Probability Distributions", <a
href="http://arxiv.org/abs/cs.IT/0506016">cs.IT/0506016</a> [<em>Abstract</em>
(in full): "We show how to store good approximations of probability
distributions in small space."]
<li>Stefano Galatolo, Mathieu Hoyrup, Cristóbal Rojas
<ul>
<li>"Dynamical systems, simulation, abstract computation", <a href="http://arxiv.org/abs/1101.0833">arxiv:1101.0833</a>
<li>"A constructive Borel-Cantelli Lemma. Constructing orbits with required statistical properties", <a href="http://arxiv.org/abs/0711.1478">arxiv:0711.1478</a>
</ul>
<li>Peter Grünwald and Paul Vitányi, "Shannon Information
and Kolmogorov Complexity", <a
href="http://arxiv.org/abs/cs.IT/0410002">cs.IT/0410002</a>
<li>Mrinalkanti Ghosh, Satyadev Nandakumar, Atanu Pal, "Ornstein Isomorphism and Algorithmic Randomness", <a href="http://arxiv.org/abs/1404.0766">arxiv:1404.0766</a>
<li>Michael Hochman, "Upcrossing Inequalities for Stationary Sequences and Applications to Entropy and Complexity", <a href="http://arxiv.org/abs/math.DS/0608311">arxiv:math.DS/0608311</a> [where "complexity" = algorithmic
information content]
<li>Mathieu Hoyrup, Cristobal Rojas, "Computability of probability measures and Martin-Lof randomness over metric spaces", <a href="http://arxiv.org/abs/0709.0907">arxiv:0709.0907</a>
<li>S. Jalalai, A. Maleki and R. G. Baraniuk, "Minimum Complexity Pursuit for Universal Compressed Sensing", <a href="http://dx.doi.org/10.1109/TIT.2014.2302005"><cite>IEEE Transactions on Information Theory</cite> <strong>60</strong> (2014): 2253--2268</a>, <a href="http://arxiv.org/abs/1208.5814">arxiv:1208.5814</a>
<li>Takakazu Mori, Yoshiki Tsujii, Mariko Yasugi, "Computability of Probability Distributions and Characteristic Functions", <a href="http://arxiv.org/abs/1307.6357">arxiv:1307.6357</a>
<li>Andrej Muchnik, "Algorithmic randomness and splitting of supermartingales", <a href="http://arxiv.org/abs/0807.3156">arxiv:0807.3156</a>
<li>Markus Mueller, "Stationary Algorithmic Probability", <a href="http://arxiv.org/abs/cs/0608095">arxiv:cs/0608095</a>
<li>Sven Neth, "A Dilemma for Solomonoff Prediction", <a href="http://arxiv.org/abs/2206.06473">arxiv:2206.06473</a>
<li>Andrew Nies, <cite><a href="https://global.oup.com/academic/product/9780199230761">Computability and Randomness</a></cite>
<li>E. Rivals and J.-P. Delahae, "Optimal Representation in Average
Using Kolmogorov Complexity," <cite>Theoretical Computer Science</cite>
<Strong>200</strong> (1998): 261--287
<li>Jason Rute, <cite><a href="http://repository.cmu.edu/dissertations/260/">Topics in Algorithmic Randomness and Computable Analysis</a></cite>
<li>Andrei N. Soklakov, "Complexity Analysis for Algorithmically Simple
Strings," <a href="http://arxiv.org/abs/cs.LG/0009001/">cs.LG/0009001</a>
<li>H. Takashashi, "Redundancy of Universal Coding, Kolmogorov
Complexity, and Hausdorff Dimension", <a href="http://dx.doi.org/10.1109/TIT.2004.836663"><cite>IEEE Transactions on Information
Theory</cite> <strong>50</strong> (2004): 2727--2736</a>
<li>Nikolai Vereshchagin and Paul Vitanyi, "Kolmogorov's Structure
Functions with an Application to the Foundations of Model Selection,"
<a href="http://arxiv.org/abs/cs.CC/0204037">cs.CC/0204037</a>
<li>Paul Vitanyi, "Randomness," <a
href="http://arxiv.org/abs/math.PR/0110086">math.PR/0110086</a>
<li>Vladimir V'yugin, "On Instability of the Ergodic Limit Theorems with Respect to Small Violations of Algorithmic Randomness", <a href="http://arxiv.org/abs/1105.4274">arxiv:1105.4274</a>
<li>C. S. Wallace and David L. Dowe, "Minimum Message Length and Kolmogorov Complexity", <cite>The Computer Journal</cite> <strong>42</strong> (1999): 270--283
</ul>