The Minimum Description Length Principle (MDL)

Last update: 21 Apr 2025 21:17
First version:

MDL is an information-theoretic approach to machine learning, or statistical model selection, which basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself. More precisely, given a probabilistic model, Shannon's coding theorems tell you the minimal number of bits needed to encode your data, i.e., the maximum extent to which it can be compressed. Really, however, to complete the description, you need to specify the model as well, from among some set of alternatives, and this will also require a certain number of bits. Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model. This works out to being a kind of penalized maximum likelihood — the data-given-model bit is the negative log likelihood, and the model-description term is the penalty.

It's a very appealing idea, and a lot of work has (rightly) been done under this heading, though I have to say I'm not altogether convinced, both because of the issues involved in chosing a coding scheme for models, and because it's not clear that, in practice, it actually does that much better than other penalization schemes — sometimes, even, than straightforward likelihood maximization.

I should also say that what I've described above is the old-fashioned "two-part" MDL, and there are now "one-part" schemes, where (so to speak) the model coding scheme is supposed to be fixed by the data as well, in unambiguous and nearly-optimal ways. These are remarkably ingenious constructions, intimately linked to low-regret learning, as well as minimax statistics more generally, but I will not attempt to describe them here; see rather Grünwald.

Jorma Rissanen, Stochastic Complexity in Statistical Inquiry [Review: Less Is More, or Ecce data!]
Peter Grünwald
- "A Tutorial Introduction to the Minimum Description Length Principle", math.ST/0406077
- The Minimum Description Length Principle [Blurb, sample chapter]
MDL on the Web [Centralized website for MDL research]

Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online]
Peter Grünwald and John Langford, "Suboptimal behaviour of Bayes and MDL in classification under misspecification", math.ST/0406221 = Machine Learning 66 (2007): 119--149
Peter T. Hraber, Bette T. Korber, Steven Wolinsky, Henry Erlich and Elizabeth Trachtenberg, "HLA and HIV Infection Progression: Application of the Minimum Description Length Principle to Statistical Genetics", SFI Working Paper 03-04-23
Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
Beong Soo So, "Maximized log-likelihood updating and model selection", Statistics and Probability Letters 64 (2003): 293--303 [Shows how to relate some of Rissanen's ideas on predictive MDL to more conventionally-statistical notions, e.g., connecting Rissanen's "stochastic complexity" to something that looks like, but isn't quite, a Fisher information.]
Paul M. B. Vitanyi and Ming Li, "Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity", IEEE Transactions on Information Theory 46 (2000): 446--464 = cs.LG/9901014

Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]

David Dowe

Pieter Adriaans and Paul Vitanyi, "The Power and Perils of MDL", cs.LG/0612095
Joshua W. Comley and David L. Dowe, "Minimum Message Length and Generalized Bayesian Nets with Asymmetric Languages", in Grünwald et al.
The Computer Journal, 42:4 (1999) [Special issue on Kolmogorov complexity and inference]
Steven de Rooij and Peter Grünwald, "An Empirical Study of MDL Model Selection with Infinite Parametric Complexity", cs.LG/0501028
Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
M. Drmota and W. Szpankowski, "Precise minimax redundancy and regret", IEEE Transactions on Information Theory 50 (2004): 2686--2707
Dean P. Foster and Robert A. Stine, "Local Asymptotic Coding and the Minimum Description Length", IEEE Transactions on Information Theory 45 (1999): 1289--1293 [PDF preprint]
Ciprian Doru Giurcuaneanu and Jorma Rissanen, "Estimation of AR and ARMA models by stochastic complexity", math.ST/0702765
Peter D. Grünwald, In Jae Myung and Mark A. Pitt (eds.), Advances in Minimum Description Length: Theory and Applications
Marcus Hutter
- "General Loss Bounds for Universal Sequence Prediction," cs.AI/0101019
- "On Generalized Computable Universal Priors and their Convergence", cs.LG/0503026
- "Optimal Sequential Decisions based on Algorithmic Probability," cs/0306091
- "Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decision Theory," cs.AI/0012011
F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
Daniel J. Navarro, "A Note on the Applied Use of MDL Approximations", Neural Computation 16 (2004): 1763--1768
S. L. Needham and D. L. Dowe, "Message Length as an Effective Ockham's Razor in Decision Tree Induction", pp. 253--260 in AI+STATS 2001 [Available from Prof. Dowe in several formats]
Jan Poland and Marcus Hutter, "Asymptotics of Discrete MDL for Online Prediction", IEEE Transactions on Information Theory 51 (2005): 3780--3795, arxiv:cs.IT/0506022
Guoqi Qian and Hans R. Künsch, "Some Notes on Rissanen's Stochastic Complexity", Tech. Report 79, Seminar fur Statistik, ETH-Zurich (1996) [Online but I've mislaid the URL at the moment]
Jorma Rissanen, Lectures on Statistical Modeling Theory [PDF]
Teemu Roos, Petri Myllymäki and Jorma Rissanen, "MDL Denoising Revisited", cs.IT/0609138
Michael Small, "Optimal time delay embedding for nonlinear time series modeling", nlin.CD/0312011
Michael Small and C. K. Tse, "Optimal embedding parameters: A modeling paradigm", physics/0308114
Nikolai Vereshchagin and Paul Vitanyi, "Kolmogorov's Structure Functions with an Application to the Foundations of Model Selection," cs.CC/0204037
C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length
C. S. Wallace and David L. Dowe, "Minimum Message Length and Kolmogorov Complexity", The Computer Journal 42 (1999): 270--283
Bin Yu, Mininum Description Length Principle [A page collecting references to Prof. Yu's work on MDL, largely viewing it as a rather conventional form of model selection]