<?xml version="1.0"?>
<!-- name="generator" content="blosxom/2.0" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
  <channel>
    <title>Notebooks   </title>
    <link>http://bactra.org/notebooks</link>
    <description>Cosma's Notebooks</description>
    <language>en</language>

  <item>
    <title>The Minimum Description Length Principle (MDL)</title>
    <link>http://bactra.org/notebooks/2008/07/22#mdl</link>
    <description>
&lt;P&gt;MDL is an &lt;a href=&quot;information-theory.html&quot;&gt;information-theoretic&lt;/a&gt;
approach to &lt;a href=&quot;learning-inference-induction.html&quot;&gt;machine learning&lt;/a&gt;,
or statistical &lt;a href=&quot;model-selection.html&quot;&gt;model selection&lt;/a&gt;, which
basically says you should pick the model which gives you the most compact
description of the data, including the description of the model itself.  More
precisely, given a probabilistic model, Shannon's coding theorems tell you the
minimal number of bits needed to encode your data, i.e., the maximum extent to
which it can be compressed.  Really, however, to complete the description, you
need to specify the model as well, from among some set of alternatives, and
this will also require a certain number of bits.  Hence you really want to
minimize the combined length of the description of the model, plus the
description of the data under that model.  This works out to being a kind of
penalized maximum likelihood --- the data-given-model bit is the negative log
likelihood, and the model-description term is the penalty.

&lt;P&gt;It's a very appealing idea, and a lot of work has (rightly) been done under
this heading, though I have to say I'm not altogether convinced, both because
of the issues involved in chosing a coding scheme for models, and because it's
not clear that, in &lt;em&gt;practice&lt;/em&gt;, it actually does that much better than
straightforward likelihood maximization.  (See the paper by Domingos.)

&lt;P&gt;I should also say that what I've described above is the old-fashioned
&quot;two-part&quot; MDL, and there are now &quot;one-part&quot; schemes, where (so to speak) the
model coding scheme is supposed to be fixed by the data as well, in some
unambiguous and nearly-optimal manner.  However, I honestly don't understand
those yet, so I'm not competent to talk about them.

&lt;P&gt;See also:
	&lt;a href=&quot;universal-prediction.html&quot;&gt;Universal Prediction Algorithms&lt;/a&gt;

&lt;ul&gt;Recommended, the big story:
	&lt;li&gt;Jorma Rissanen, &lt;cite&gt;Stochastic Complexity in Statistical
Inquiry&lt;/cite&gt; [&lt;a
href=&quot;../reviews/stochastic-complexity-in-statistical-inquiry/&quot;&gt;Review:
Less Is More, or &lt;em&gt;Ecce data!&lt;/em&gt;&lt;/a&gt;]
	&lt;li&gt;&lt;a href=&quot;http://www.mdl-research.org/&quot;&gt;MDL on the Web&lt;/a&gt;
[Centralized website for MDL research]
	&lt;/ul&gt; 

&lt;ul&gt;Recommended, details and applications:
	&lt;li&gt;&lt;a href=&quot;http://www.cs.washington.edu/homes/pedrod/&quot;&gt;Pedro
Domingos&lt;/a&gt;, &quot;The Role of Occam's Razor in Knowledge Discovery,&quot; &lt;cite&gt;Data
Mining and Knowledge Discovery,&lt;/cite&gt; &lt;strong&gt;3&lt;/strong&gt; (1999) [&lt;a
href=&quot;http://www.cs.washington.edu/homes/pedrod/dmkd99.ps.gz&quot;&gt;Online&lt;/a&gt;]
	&lt;li&gt;Peter T. Hraber, Bette T. Korber, Steven Wolinsky, Henry Erlich and
Elizabeth Trachtenberg, &quot;HLA and HIV Infection Progression: Application of the
Minimum Description Length Principle to Statistical Genetics&quot;, &lt;a
href=&quot;http://www.santafe.edu/research/publications/wpabstract/200304023&quot;&gt;SFI
Working Paper 03-04-23&lt;/a&gt;
	&lt;li&gt;Shane Legg, &quot;Is There an Elegant Universal Theory of
Prediction?&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.AI/0606070&quot;&gt;cs.AI/0606070&lt;/a&gt; [A
nice set of diagonalization arguments against the hope of a universal
prediction scheme which has the nice features of Solomonoff-style induction,
but is actually computable.]
	&lt;li&gt;Beong Soo So, &quot;Maximized log-likelihood updating and model
selection&quot;, &lt;cite&gt;Statistics and Probability Letters&lt;/cite&gt; &lt;strong&gt;64&lt;/strong&gt;
(2003): 293--303 [Shows how to relate some of Rissanen's ideas on predictive
MDL to more conventionally-statistical notions, e.g., connecting Rissanen's
&quot;stochastic complexity&quot; to something that looks like, but isn't quite, a Fisher
information.]
	&lt;li&gt;Paul M. B. Vitanyi and Ming Li, &quot;Minimum Description Length
Induction, Bayesianism, and Kolmogorov Complexity&quot;, &lt;cite&gt;IEEE Transactions on
Information Theory&lt;/cite&gt; &lt;strong&gt;46&lt;/strong&gt; (2000): 446--464 = &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/9901014&quot;&gt;cs.LG/9901014&lt;/a&gt;
	&lt;/ul&gt;

&lt;ul&gt;Not recommended:
	&lt;li&gt;Dana Ballard, &lt;cite&gt;An Introduction to Natural Computation&lt;/cite&gt;
[&lt;a href=&quot;../reviews/ballard-natural/&quot;&gt;Review: Not Natural Enough&lt;/a&gt;]
	&lt;/ul&gt; 

&lt;ul&gt;To read (with thanks to &lt;a href=&quot;http://www.csse.monash.edu.au/~dld/&quot;&gt;David
Dowe&lt;/a&gt; for suggestions):
	&lt;li&gt;Pieter Adriaans and Paul Vitanyi, &quot;The Power and Perils of MDL&quot;,
&lt;a href=&quot;http://arxiv.org/abs/cs.LG/0612095&quot;&gt;cs.LG/0612095&lt;/a&gt;
	&lt;li&gt;Kenneth P. Burnham and David R. Anderson, &lt;cite&gt;Model Selection and
Inference: A Practical Information-Theoretic Approach&lt;/cite&gt;
	&lt;li&gt;Joshua W. Comley and David L. Dowe, &quot;Minimum Message Length and Generalized Bayesian Nets with Asymmetric Languages&quot;, in Gr&amp;uuml;nwald et al.
	&lt;li&gt;&lt;cite&gt;The Computer Journal&lt;/cite&gt;, &lt;strong&gt;42:4&lt;/strong&gt; (1999)
[Special issue on Kolmogorov complexity and inference]
	&lt;li&gt;Steven de Rooij and Peter Grunwald, &quot;An Empirical Study of MDL
Model Selection with Infinite Parametric Complexity&quot;, &lt;a
href=&quot;http://arxiv.org/abs/cs.LG/0501028&quot;&gt;cs.LG/0501028&lt;/a&gt;
	&lt;li&gt;Dowe, Korb and Oliver (eds.), &lt;cite&gt;Information, Statistics and 
Induction in Science&lt;/cite&gt; 
	&lt;li&gt;M. Drmota and W. Szpankowski, &quot;Precise minimax redundancy and
regret&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1109/TIT.2004.836702&quot;&gt;&lt;cite&gt;IEEE Transactions on Information Theory&lt;/cite&gt; &lt;strong&gt;50&lt;/strong&gt; (2004): 2686--2707&lt;/a&gt;
	&lt;li&gt;&lt;a href=&quot;http://gosset.wharton.upenn.edu/~foster/&quot;&gt;Dean
P. Foster&lt;/a&gt; and &lt;a href=&quot;http://www-stat.wharton.upenn.edu/~stine/&quot;&gt;Robert
A. Stine&lt;/a&gt;, &quot;Local Asymptotic Coding and the Minimum Description Length&quot;, &lt;a
href=&quot;http://dx.doi.org/10.1109/18.761287&quot;&gt;&lt;cite&gt;IEEE Transactions on
Information Theory&lt;/cite&gt; &lt;strong&gt;45&lt;/strong&gt; (1999): 1289--1293&lt;/a&gt; [&lt;a
href=&quot;http://gosset.wharton.upenn.edu/~foster/research/lac.pdf&quot;&gt;PDF
preprint&lt;/a&gt;]
	&lt;li&gt;Ciprian Doru Giurcuaneanu and Jorma Rissanen, &quot;Estimation of AR and
ARMA models by stochastic
complexity&quot;, &lt;a href=&quot;http://arxiv.org/abs/math.ST/0702765&quot;&gt;math.ST/0702765&lt;/a&gt;
	&lt;li&gt;Peter Grunwald
		&lt;ul&gt;
		&lt;li&gt;&quot;A Tutorial Introduction to the Minimum Description
Length Principle&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0406077&quot;&gt;math.ST/0406077&lt;/a&gt;
		&lt;li&gt;&lt;cite&gt;The Minimum Description Length
Principle&lt;/cite&gt; [&lt;a href=&quot;http://mitpress.mit.edu/978-0-262-07281-6&quot;&gt;Blurb,
sample chapter&lt;/a&gt;]
		&lt;/ul&gt;
	&lt;li&gt;Peter Grunwald and John Langford, &quot;Suboptimal behaviour of Bayes
and MDL in classification under misspecification&quot;, &lt;a
href=&quot;http://arxiv.org/abs/math.ST/0406221&quot;&gt;math.ST/0406221&lt;/a&gt;
= &lt;a href=&quot;http://dx.doi.org/10.1007/s10994-007-0716-7&quot;&gt;&lt;cite&gt;Machine
Learning&lt;/cite&gt; &lt;strong&gt;66&lt;/strong&gt; (2007): 119--149&lt;/a&gt;
	&lt;li&gt;Peter D. Gr&amp;uuml;nwald, In Jae Myung and Mark A. Pitt (eds.),
&lt;cite&gt;Advances in Minimum Description Length: Theory and Applications&lt;/cite&gt;
[&lt;a href=&quot;http://mitpress.mit.edu/0-262-07262-9&quot;&gt;blurb&lt;/a&gt;]
	&lt;li&gt;Marcus Hutter
		&lt;ul&gt;
		&lt;li&gt;&quot;General Loss Bounds for Universal Sequence Prediction,&quot;
&lt;a href=&quot;http://arxiv.org/abs/cs.AI/0101019&quot;&gt;cs.AI/0101019&lt;/a&gt;
		&lt;li&gt;&quot;On Generalized Computable Universal Priors and their
Convergence&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.LG/0503026&quot;&gt;cs.LG/0503026&lt;/a&gt;
		&lt;li&gt;&quot;Optimal Sequential Decisions based on Algorithmic
Probability,&quot; &lt;a href=&quot;http://arxiv.org/abs/cs/0306091&quot;&gt;cs/0306091&lt;/a&gt;
		&lt;li&gt;&quot;Towards a Universal Theory of Artificial
Intelligence based on Algorithmic Probability and Sequential Decision Theory,&quot;
&lt;a href=&quot;http://arxiv.org/abs/cs.AI/0012011&quot;&gt;cs.AI/0012011&lt;/a&gt;
		&lt;/ul&gt;
	&lt;li&gt;F. Liang and A. Barron, &quot;Exact Minimax Strategies for Predictive 
Density Estimation, Data Compression, and Model Selection&quot;, &lt;a 
href=&quot;http://dx.doi.org/0.1109/TIT.2004.836922&quot;&gt;&lt;cite&gt;IEEE Transactions on Information Theory&lt;/cite&gt; 
&lt;strong&gt;50&lt;/strong&gt; (2004): 2708--2726&lt;/a&gt;
	&lt;li&gt;Daniel J. Navarro, &quot;A Note on the Applied Use of MDL Approximations&quot;,
&lt;a href=&quot;http://neco.mitpress.org/cgi/content/abstract/16/9/1763&quot;&gt;&lt;cite&gt;Neural
Computation&lt;/cite&gt; &lt;strong&gt;16&lt;/strong&gt; (2004): 1763--1768&lt;/a&gt;
	&lt;li&gt;S. L. Needham and D. L. Dowe, &quot;Message Length as an Effective
Ockham's Razor in Decision Tree Induction&quot;, pp. 253--260 in &lt;cite&gt;AI+STATS
2001&lt;/cite&gt; [Available from Prof. Dowe in &lt;a href=&quot;http://www.csse.monash.edu.au/~dld/David.Dowe.publications.html#NeedhamDowe2001&quot;&gt;several formats&lt;/a&gt;]
	&lt;li&gt;Jan Poland and Marcus Hutter, &quot;Asymptotics of Discrete MDL for
Online
Prediction&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.IT/0506022&quot;&gt;cs.IT&lt;/a&gt;
= &lt;a href=&quot;http://dx.doi.org/10.1109/TIT.2005.856956&quot;&gt;&lt;cite&gt;IEEE Transactions
on Information Theory&lt;/cite&gt; &lt;strong&gt;51&lt;/strong&gt; (2005): 3780--3795&lt;/a&gt;
	&lt;li&gt;Guoqi Qian and Hans R. K&amp;uuml;nsch, &quot;Some Notes on Rissanen's
Stochastic Complexity&quot;, Tech. Report 79, Seminar fur Statistik, ETH-Zurich
(1996) [Online but I've mislaid the URL at the moment]
	&lt;li&gt;Jorma Rissanen, &lt;cite&gt;Lectures on Statistical Modeling
Theory&lt;/cite&gt; [&lt;a href=&quot;http://www.mdl-research.org/pub/lectures.pdf&quot;&gt;PDF&lt;/a&gt;]
	&lt;li&gt;E. Rivals and J.-P. Delahae, &quot;Optimal Representation in Average 
Using Kolmogorov Complexity,&quot; &lt;cite&gt;Theoretical Computer Science&lt;/cite&gt; 
&lt;Strong&gt;200&lt;/strong&gt; (1998): 261--287 
	&lt;li&gt;Teemu Roos, Petri Myllym&amp;auml;ki and Jorma Rissanen, &quot;MDL Denoising
Revisited&quot;, &lt;a href=&quot;http://arxiv.org/abs/cs.IT/0609138&quot;&gt;cs.IT/0609138&lt;/a&gt;
	&lt;li&gt;Michael Small, &quot;Optimal time delay embedding for nonlinear time
series modeling&quot;, &lt;a
href=&quot;http://arxiv.org/abs/nlin.CD/0312011&quot;&gt;nlin.CD/0312011&lt;/a&gt;
	&lt;li&gt;Michael Small and C. K. Tse, &quot;Optimal embedding parameters: A
modeling paradigm&quot;, &lt;a
href=&quot;http://arxiv.org/abs/physics/0308114&quot;&gt;physics/0308114&lt;/a&gt;
	&lt;li&gt;Nikolai Vereshchagin and Paul Vitanyi, &quot;Kolmogorov's Structure 
Functions with an Application to the Foundations of Model Selection,&quot; 
&lt;a href=&quot;http://arxiv.org/abs/cs.CC/0204037&quot;&gt;cs.CC/0204037&lt;/a&gt; 
	&lt;li&gt;C. S. Wallace, &lt;cite&gt;Statistical and Inductive Inference by Minimum
Message Length&lt;/cite&gt; [&lt;a
href=&quot;http://www.springeronline.com/sgw/cda/frontpage/0,11855,5-0-22-35893962-0,00.html&quot;&gt;Blurb&lt;/a&gt;]
	&lt;li&gt;C. S. Wallace and David L. Dowe, &quot;Minimum Message Length and Kolmogorov Complexity&quot;, &lt;cite&gt;The Computer Journal&lt;/cite&gt; &lt;strong&gt;42&lt;/strong&gt; (1999): 270--283
	&lt;/ul&gt;
</description>
  </item>
  </channel>
</rss>