Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenProbably Algorithmically True
http://bactra.org/notebooks/2003/11/17#probably-algorithmically-true
<P>The following is merely an amusement; it contains (at least) one fatal
error, one conclusion which is valid but phrased so as to mislead the unwary,
and one unwarranted technical assumption (which may actually be OK but I
haven't checked). It may help if you've read the <a
href="godels-theorem.html">notebook on Gödel's Theorem</a>, and/or some
actual math on that subject; likewise for <a
href="http://bactra.org/reviews/kearns-vazirani/">computational</a> and <a
href="http://bactra.org/reviews/vapnik-nature/">statistical</a> learning
theory. [With thanks to Ben Wieland for correcting an embarrassing mix-up
between <em>truths</em> and <em>theorems</em> in my initial statement of the
theorem. Sheesh.]</P>
<P>Let us suppose we have a formal language <i>L</i> and a set of axioms
<i>A</i> in <i>L</i>. There are some sentences of <i>L</i> which will be true
under any interpretation in which <i>A</i> is true; these are the theorems
following from <i>A</i>, and we'll call them, collectively, <i>T</i>. Let us
suppose further that <i>T</i> is sufficiently powerful that it can model
ordinary (Peano) arithmetic, i.e., there is some interpretation of arithmetic
in terms of <i>L</i> such that all the theorems of arithmetic are included in
<i>T</i>. Then Gödel's Theorem tells us that the system must be
inconsistent, or incomplete, or both. If it is <em>inconsistent</em>, then
there are sentences <i>s</i> of <i>L</i> such that both <i>s</i> and its
negation, <i>~s</i>, are in <i>T</i>. If it is <em>incomplete</em>, then there
are sentences <i>s</i> such that neither <i>s</i> nor <i>~s</i> is in <i>T</i>.
In any given interpretation of <i>L</i>, any such <i>s</i> is either true or
false, but <i>s</i> is true in some interpretations in which the axioms are
true, and false in others. In other words, the truth of these sentences
depends on the particular interpretation, and not (just) on the truth of the
axioms. <a href="#footnote" name="returnfromnote">*</a> Since inconsistent
formal systems are uninteresting, let's ignore them.</P>
<P>The real substance of incompleteness can be restated in several ways. One
is that there is no algorithmic, computable <em>decision procedure</em> for
<i>T</i> --- no algorithm which, given a sentence <i>s</i> in <i>L</i>, can
determine whether or not <i>s</i> belongs in <i>T</i>, and so is necessarily
true if the axioms hold. Another is that there is no algorithmic, computable
procedure to generate all the theorems of <i>T</i>, and no other sentences.
(To see that these two are equivalent, note that "Can we generate the
sentence <i>s</i>?" is a decision procedure, and feeding all possible
sentences, in order, through a decision procedure is a generating procedure.)
Let's see if we can't nibble away at the decision procedure angle.</P>
<P>Any sentence of <i>L</i> can be assigned some natural number, called its
Gödel number, from which we can recover the sentence. Now there are just
as many rational numbers between 0 and 1 as there are natural numbers, so to
each Gödel number there corresponds a certain fraction in the unit
interval. A decision procedure, then, is a function <i>D</i> from the rational
fractions to 0 (non-theorem) or 1 (theorem). As a mathematical object,
<i>D</i> definitely exists alright; there's no question about that. The
content of Gödel's Theorem is that <i>D</i> is not computable, which is to
say that the shortest program to compute it is infinitely long --- <a
href="complexity-measures.html">the Kolmogorov complexity of <i>D</i> is
infinite</a>.</P>
<P>So, we can't actually compute <i>D.</i> But can we learn it? More
precisely, can we reliably learn computable approximations to <i>D</i>?</P>
<P>The obvious answer is "yes, <i>D</i> is just another classification
function". Take our favorite class of (say) <a href="neural-nets.html">neural
networks</a>, which output 0 or 1 for each (real-valued) point on the unit
interval. Now generate random sentences of
<i>L</i>, according to some distribution. (You might generate unbounded random
integers, discarding the ones which are not Gödel numbers, and then
convert the Gödel numbers to rational fractions. Or you might do
something more clever, involving the grammar of <i>L</i>; doesn't matter.) For
each network, there is some probability that, given a random sentence you
generate, it will mis-classify the sentence (outputting 0 when it should've
said 1, or vice versa). Call this probability the network's <em>generalization
error</em>. Take the infimum over generalization errors of all the networks in
our class, and call the result <i>m</i>. This is the best attainable error
rate in the class.</P>
<P>Now, draw some particular sample of sentences. For each network, define the
<em>in-sample error</em> as the fraction of mis-classified sentences in that
sample. Pick out the network, call it <i>N</i>, with the lowest in-sample
error rate. Let <i>e(N)</i> be the <em>generalization</em> error of <i>N</i>.
The fundamental theorem of statistical learning theory, due to Vapnik and
Chervonenkis, says that, as the sample size grows, <i>e(N)</i> converges in
probability on <i>m</i>. So say I ask you to find me a network whose
performance comes within some arbitrary margin of approximation, call it
<i>E</i>, of the best possible --- i.e., I get to pick <i>E</i>, and challenge
you to find me a network whose generalization error is at most <i>m+E</i>.
If <i>e(N)</i> converged on <i>m</i>, period, you could always do this by just
letting the sample size be big enough, and then handing me <i>N</i>.
Unfortunately, the convergence is only "in probability", but that means that,
by letting the sample size get big enough, you can make the chance that
<i>N</i> does not meet my challenge arbitrarily small. More precisely, you can
pick any confidence level <i>p</i> you want, and then, if you just take enough
samples, the probability that <i>e(N) > m + E</i> is, at most, <i>p</i>.</P>
<P>What this means is that, with enough random sentences, we can get a decision
procedure which is, with very high probability, almost as good as the best
decision procedure in that class of neural networks. How good is the best
network? Here is why I specified neural networks, rather than some other kind
of classifier (e.g., decision trees or support vector machines). Neural
networks are well known to be universal function approximators, meaning that if
we want to approximate (essentially) any real function to arbitrarily close
tolerance, there's some neural network which will do it, provided we'll give it
enough nodes. The function we want to approximate is <i>D</i>, who's
generalization error would be zero, so the infimum of error rates over all
possible neural networks is, in fact, zero. As we let the number of nodes in
our network grow, we get larger and larger classes of networks, which contain
successively closer and closer approximations of <i>D</i>. More complex
networks require more samples than simple ones to reach the same level of
probable approximate correctness, which in this case means probable approximate
truth, but one can still push the probability and the approximation up
arbitrarily high. We don't know the best attainable error for any given degree
of complexity, but we don't have to --- we just have to know that it's going
down (which we do), and know the rate of convergence of the best in-sample
network to the best possible network (which only depends on the class of
networks, through a quantity called the VC dimension).</P>
<P>So here is our final strategy ("structural risk minimization") for learning
something which is probably approximately a decision procedure for a formally
undecidable theory <i>T</i> in the language <i>L</i>. Start with a big sample
of random sentences in <i>L</i>. For each degree of neural network complexity
(one hidden node, two, etc.), find the network with the best in-sample error
rate. Pick a confidence level <i>p</i>, and discount the in-sample
performances by the expected degree of over-fitting, which depends on the VC
dimension and on <i>p</i>. Chose the network with the smallest discounted
generalization error. Then by (e.g.) Theorem 4.1 in Vapnik's <cite>Nature of
Statistical Learning Theory</cite>, the resulting network's generalization
error converges in probability on the infimum of the error over all possible
networks, which we've just shown is zero. Hence the network converges in
probability on <i>D</i>, even though, at every stage, the network is a
computable function. Q.E.D.</P>
<P>This procedure is easily modified so as to eliminate the possibility of
either false positives (non-theorems classified as theorems) or false negatives
(theorems rejected as non-theorems), as desired. Furthermore, it would be easy
to extended it by applying any of the various methods for combining classifiers
known in the literature, e.g., boosting.</P>
<P><a name="footnote">*</a>: Yet more formally, we can state the theorem as: No
theory can be at once consistent, complete, axiomatizable and an extension of
Peano arithmetic. Hence, if <i>T</i> is axiomatizable (which it is, by
hypothesis), and an extension of Peano arithmetic (which it is, by hypothesis),
then it cannot be both consistent and complete. So <i>T</i> must be
inconsistent or incomplete. <a href="#returnfromnote">(Back)</a></P>