The Minimum Description Length Principle (MDL)

24 Jan 2018 15:37

MDL is an information-theoretic approach to machine learning, or statistical model selection, which basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself. More precisely, given a probabilistic model, Shannon's coding theorems tell you the minimal number of bits needed to encode your data, i.e., the maximum extent to which it can be compressed. Really, however, to complete the description, you need to specify the model as well, from among some set of alternatives, and this will also require a certain number of bits. Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model. This works out to being a kind of penalized maximum likelihood — the data-given-model bit is the negative log likelihood, and the model-description term is the penalty.

It's a very appealing idea, and a lot of work has (rightly) been done under this heading, though I have to say I'm not altogether convinced, both because of the issues involved in chosing a coding scheme for models, and because it's not clear that, in practice, it actually does that much better than other penalization schemes — sometimes, even, than straightforward likelihood maximization.

I should also say that what I've described above is the old-fashioned "two-part" MDL, and there are now "one-part" schemes, where (so to speak) the model coding scheme is supposed to be fixed by the data as well, in unambiguous and nearly-optimal ways. These are remarkably ingenious constructions, intimately linked to low-regret learning, as well as minimax statistics more generally, but I will not attempt to describe them here; see rather Grünwald.

See also: Algorithmic Information Theory; Occam's Razor; Universal Prediction Algorithms