## "Occam"-style Bounds for Long Programs

*23 Sep 2021 11:13*

What follows is essentially the first draft I wrote sometime on or before 17 June 2010, with a few old links fixed and nicer mathematical type-setting. I still think it's right.

A common idea in machine learning is that one can explain, or even prove, Occam's Razor in terms of algorithmic information content. This rests on results which say that if one finds a simple classification rule which works well in-sample, then it is highly probable that it will continue to work well out-of-sample. (The clearest and most elementary explanation of such results I've seen is in Kearns and Vazirani's Introduction to Computational Learning Theory.)

I think there are several things wrong with this --- not with the theorem, which is perfectly valid, but with the argument.

- Occam's razor is about
*truth*, not about*generalization performance*. Generalization can be enhanced by picking a model which you*know*is wrong, if for instance fitting the correct model would subject one to the curse of dimensionality. - More generally, this doesn't seem like a very productive way to think
about Occam's Razor. This should, IMHO, be seen as a property
of
*methods*rather than of*theories*. That is, the Razor is the rule which says "among the theories compatible with the evidence, chose the simplest", and the question is why this leads to the truth*better*than a rule like "among the theories compatible with the evidence, chose the one whose statement involves the fewest occurrences of the letter 'e'", or even "the one which most glorifies Divine Providence". - The result seems to hinge on a special way of defining "simple rules", when it actually doesn't.

The technical results say that a classification rule is simple if it has a short description, measured in bits. (That is, we are in minimum description length land, or very close to it.) The shorter the description, the tighter the bound on the generalization error. I am happy to agree that this is a reasonable (if language-dependent) way of defining "simplicity" for classifier rules. However, so far as I can tell, this really isn't what makes the proofs work.

What actually makes the proofs go is the fact that there are not very many
binary strings of length \( m \) for small \( m \). Finding a classifier
with a short description means that you have searched over a small space. It's
the restriction of the search space which really does all the work. It's not
the *simplicity* of the rules which matters, but the fact that simple
rules are scarce in the space of all possible rules. If confines oneself to
elaborate, baroque rules, but sticks to a *particular style* of baroque
elaboration, one obtains the same effect.

For example, suppose we have \( n \) data points for a binary
classification problem. Enumerate the rules in some order, say lexicographic.
Now consider the set of rules whose serial numbers are
exactly \( m^{m^m} \), with \( m \)
being any prime number less than or equal to \( n \). By the prime number
theorem, there are approximately \( \frac{n}{\log{n}} \) such prime numbers.
Suppose one of these rules correctly classifies all the data. The likelihood
of it doing so if its actual error rate were \( q \) is
\( (1-q)^n \). By a union bound, the probability that
the actual error rate is \( q \) or more
is \( \leq n (1-q)^n / \log{n} \). Notice that
no matter how low an error rate I demand, I can achieve it with arbitrarily
high confidence by taking \( n \) sufficiently large. But, notice, most of
the rules involves here have enormous serial numbers, and so must be very long
and complicated programs. That doesn't matter, because there still just aren't
many of them. In fact, this argument would still work if I replaced the prime
less than or equal to \( n \) with a *random* set, provided the
density of the random set was \( 1/\log{n} \).

There may be something I'm missing here — that's why this is a note and not a paper — but I honestly don't see what.

See also: Algorithmic Information Theory