"Occam"-style Bounds for Long Programs

23 Sep 2021 13:40

What follows is essentially the first draft I wrote sometime on or before 17 June 2010, with a few old links fixed and nicer mathematical type-setting. I still think it's right.

A common idea in machine learning is that one can explain, or even prove, Occam's Razor in terms of algorithmic information content. This rests on results which say that if one finds a simple classification rule which works well in-sample, then it is highly probable that it will continue to work well out-of-sample. (The clearest and most elementary explanation of such results I've seen is in Kearns and Vazirani's Introduction to Computational Learning Theory.)

I think there are several things wrong with this --- not with the theorem, which is perfectly valid, but with the argument.

  1. Occam's razor is about truth, not about generalization performance. Generalization can be enhanced by picking a model which you know is wrong, if for instance fitting the correct model would subject one to the curse of dimensionality.
  2. More generally, this doesn't seem like a very productive way to think about Occam's Razor. This should, IMHO, be seen as a property of methods rather than of theories. That is, the Razor is the rule which says "among the theories compatible with the evidence, chose the simplest", and the question is why this leads to the truth better than a rule like "among the theories compatible with the evidence, chose the one whose statement involves the fewest occurrences of the letter 'e'", or even "the one which most glorifies Divine Providence".
  3. The result seems to hinge on a special way of defining "simple rules", when it actually doesn't.
Point (1) is something I've covered in section 2.1 of lecture 23 in my data-mining course chapter 3 of Advanced Data Analysis from an Elementary Point of View. Point (2) is something I'm copying from my friend Kevin Kelly, so I'll just refer you to him, and/or my notebook on the Razor itself. Point (3) is what I'll elaborate on here.

The technical results say that a classification rule is simple if it has a short description, measured in bits. (That is, we are in minimum description length land, or very close to it.) The shorter the description, the tighter the bound on the generalization error. I am happy to agree that this is a reasonable (if language-dependent) way of defining "simplicity" for classifier rules. However, so far as I can tell, this really isn't what makes the proofs work.

What actually makes the proofs go is the fact that there are not very many binary strings of length \( m \) for small \( m \). Finding a classifier with a short description means that you have searched over a small space. It's the restriction of the search space which really does all the work. It's not the simplicity of the rules which matters, but the fact that simple rules are scarce in the space of all possible rules. If confines oneself to elaborate, baroque rules, but sticks to a particular style of baroque elaboration, one obtains the same effect.

For example, suppose we have \( n \) data points for a binary classification problem. Enumerate the rules in some order, say lexicographic. Now consider the set of rules whose serial numbers are exactly \( m^{m^m} \), with \( m \) being any prime number less than or equal to \( n \). By the prime number theorem, there are approximately \( \frac{n}{\log{n}} \) such prime numbers. Suppose one of these rules correctly classifies all the data. The likelihood of it doing so if its actual error rate were \( q \) is \( (1-q)^n \). By a union bound, the probability that the actual error rate is \( q \) or more is \( \leq n (1-q)^n / \log{n} \). Notice that no matter how low an error rate I demand, I can achieve it with arbitrarily high confidence by taking \( n \) sufficiently large. But, notice, most of the rules involves here have enormous serial numbers, and so must be very long and complicated programs. That doesn't matter, because there still just aren't many of them. In fact, this argument would still work if I replaced the prime less than or equal to \( n \) with a random set, provided the density of the random set was \( 1/\log{n} \).

There may be something I'm missing here — that's why this is a note and not a paper — but I honestly don't see what.

See also: Algorithmic Information Theory