Notebooks
http://bactra.org/notebooks
Cosma's Notebooksen"Occam"-style Bounds for Long Programs
http://bactra.org/notebooks/2021/09/23#occam-bounds-for-long-programs
<blockquote>What follows is essentially the first draft I wrote sometime on or
before 17 June 2010, with a few old links fixed and nicer mathematical
type-setting. I still think it's right.</blockquote>
<P>A common idea in <a href="learning-inference-induction.html">machine
learning</a> is that one can explain, or even
prove, <a href="occams-razor.html">Occam's Razor</a> in terms of algorithmic
information content. This rests on results which say that if one finds a
simple classification rule which works well in-sample, then it is highly
probable that it will continue to work well out-of-sample. (The clearest and
most elementary explanation of such results I've seen is in Kearns and
Vazirani's <cite><a href="../reviews/kearns-vazirani/">Introduction to
Computational Learning Theory</a></cite>.)
<P>I think there are several things wrong with this --- not with the theorem,
which is perfectly valid, but with the argument.
<ol>
<li> Occam's razor is about <em>truth</em>, not about <em>generalization
performance</em>. Generalization can be enhanced by picking a model which you
<em>know</em> is wrong, if for instance fitting the correct model would subject
one to the curse of dimensionality.
<li> More generally, this doesn't seem like a very productive way to think
about Occam's Razor. This should, IMHO, be seen as a property
of <em>methods</em> rather than of <em>theories</em>. That is, the Razor is
the rule which says "among the theories compatible with the evidence, chose the
simplest", and the question is why this leads to the truth <em>better</em> than
a rule like "among the theories compatible with the evidence, chose the one
whose statement involves the fewest occurrences of the letter 'e'", or even
"the one which most glorifies Divine Providence".
<li> The result seems to hinge on a special way of defining "simple rules",
when it actually doesn't.
</ol>
Point (1) is something I've covered in
<strike>section 2.1 of lecture 23 in
my <a href="http://www.stat.cmu.edu/~cshalizi/350/">data-mining course</a></strike> chapter 3 of <cite><a href="https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/">Advanced Data Analysis from an Elementary Point of View</a></cite>. Point (2) is something I'm copying from my
friend Kevin Kelly, so I'll
just <a href="http://www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm">refer you
to him</a>, and/or my <a href="occams-razor.html">notebook on the Razor itself</a>. Point (3) is what I'll elaborate on here.
<P>The technical results say that a classification rule is simple if it has a
short description, measured in bits. (That is, we are
in <a href="mdl.html">minimum description length land</a>, or very close to
it.) The shorter the description, the tighter the bound on the generalization
error. I am happy to agree that this is a reasonable (if language-dependent)
way of defining "simplicity" for classifier rules. However, so far as I can
tell, this really isn't what makes the proofs work.
<P>What actually makes the proofs go is the fact that there are not very many
binary strings of length \( m \) for small \( m \). Finding a classifier
with a short description means that you have searched over a small space. It's
the restriction of the search space which really does all the work. It's not
the <em>simplicity</em> of the rules which matters, but the fact that simple
rules are scarce in the space of all possible rules. If confines oneself to
elaborate, baroque rules, but sticks to a <em>particular style</em> of baroque
elaboration, one obtains the same effect.
<P>For example, suppose we have \( n \) data points for a binary
classification problem. Enumerate the rules in some order, say lexicographic.
Now consider the set of rules whose serial numbers are
exactly \( m^{m^m} \), with \( m \)
being any prime number less than or equal to \( n \). By the <a href="https://en.wikipedia.org/wiki/Prime_number_theorem">prime number
theorem</a>, there are approximately \( \frac{n}{\log{n}} \) such prime numbers.
Suppose one of these rules correctly classifies all the data. The likelihood
of it doing so if its actual error rate were \( q \) is
\( (1-q)^n \). By a union bound, the probability that
the actual error rate is \( q \) or more
is \( \leq n (1-q)^n / \log{n} \). Notice that
no matter how low an error rate I demand, I can achieve it with arbitrarily
high confidence by taking \( n \) sufficiently large. But, notice, most of
the rules involves here have enormous serial numbers, and so must be very long
and complicated programs. That doesn't matter, because there still just aren't
many of them. In fact, this argument would still work if I replaced the prime
less than or equal to \( n \) with a <em>random</em> set, provided the
density of the random set was \( 1/\log{n} \).
<P>There may be something I'm missing here — that's why this is a note
and not a paper — but I honestly don't see what.
<P>See also:
<a href="algorithmic-information-theory.html">Algorithmic Information Theory</a>