Probably Algorithmically True
17 Nov 2003 19:28
The following is merely an amusement; it contains (at least) one fatal error, one conclusion which is valid but phrased so as to mislead the unwary, and one unwarranted technical assumption (which may actually be OK but I haven't checked). It may help if you've read the notebook on Gödel's Theorem, and/or some actual math on that subject; likewise for computational and statistical learning theory. [With thanks to Ben Wieland for correcting an embarrassing mix-up between truths and theorems in my initial statement of the theorem. Sheesh.]
Let us suppose we have a formal language L and a set of axioms A in L. There are some sentences of L which will be true under any interpretation in which A is true; these are the theorems following from A, and we'll call them, collectively, T. Let us suppose further that T is sufficiently powerful that it can model ordinary (Peano) arithmetic, i.e., there is some interpretation of arithmetic in terms of L such that all the theorems of arithmetic are included in T. Then Gödel's Theorem tells us that the system must be inconsistent, or incomplete, or both. If it is inconsistent, then there are sentences s of L such that both s and its negation, ~s, are in T. If it is incomplete, then there are sentences s such that neither s nor ~s is in T. In any given interpretation of L, any such s is either true or false, but s is true in some interpretations in which the axioms are true, and false in others. In other words, the truth of these sentences depends on the particular interpretation, and not (just) on the truth of the axioms. * Since inconsistent formal systems are uninteresting, let's ignore them.
The real substance of incompleteness can be restated in several ways. One is that there is no algorithmic, computable decision procedure for T --- no algorithm which, given a sentence s in L, can determine whether or not s belongs in T, and so is necessarily true if the axioms hold. Another is that there is no algorithmic, computable procedure to generate all the theorems of T, and no other sentences. (To see that these two are equivalent, note that "Can we generate the sentence s?" is a decision procedure, and feeding all possible sentences, in order, through a decision procedure is a generating procedure.) Let's see if we can't nibble away at the decision procedure angle.
Any sentence of L can be assigned some natural number, called its Gödel number, from which we can recover the sentence. Now there are just as many rational numbers between 0 and 1 as there are natural numbers, so to each Gödel number there corresponds a certain fraction in the unit interval. A decision procedure, then, is a function D from the rational fractions to 0 (non-theorem) or 1 (theorem). As a mathematical object, D definitely exists alright; there's no question about that. The content of Gödel's Theorem is that D is not computable, which is to say that the shortest program to compute it is infinitely long --- the Kolmogorov complexity of D is infinite.
So, we can't actually compute D. But can we learn it? More precisely, can we reliably learn computable approximations to D?
The obvious answer is "yes, D is just another classification function". Take our favorite class of (say) neural networks, which output 0 or 1 for each (real-valued) point on the unit interval. Now generate random sentences of L, according to some distribution. (You might generate unbounded random integers, discarding the ones which are not Gödel numbers, and then convert the Gödel numbers to rational fractions. Or you might do something more clever, involving the grammar of L; doesn't matter.) For each network, there is some probability that, given a random sentence you generate, it will mis-classify the sentence (outputting 0 when it should've said 1, or vice versa). Call this probability the network's generalization error. Take the infimum over generalization errors of all the networks in our class, and call the result m. This is the best attainable error rate in the class.
Now, draw some particular sample of sentences. For each network, define the in-sample error as the fraction of mis-classified sentences in that sample. Pick out the network, call it N, with the lowest in-sample error rate. Let e(N) be the generalization error of N. The fundamental theorem of statistical learning theory, due to Vapnik and Chervonenkis, says that, as the sample size grows, e(N) converges in probability on m. So say I ask you to find me a network whose performance comes within some arbitrary margin of approximation, call it E, of the best possible --- i.e., I get to pick E, and challenge you to find me a network whose generalization error is at most m+E. If e(N) converged on m, period, you could always do this by just letting the sample size be big enough, and then handing me N. Unfortunately, the convergence is only "in probability", but that means that, by letting the sample size get big enough, you can make the chance that N does not meet my challenge arbitrarily small. More precisely, you can pick any confidence level p you want, and then, if you just take enough samples, the probability that e(N) > m + E is, at most, p.
What this means is that, with enough random sentences, we can get a decision procedure which is, with very high probability, almost as good as the best decision procedure in that class of neural networks. How good is the best network? Here is why I specified neural networks, rather than some other kind of classifier (e.g., decision trees or support vector machines). Neural networks are well known to be universal function approximators, meaning that if we want to approximate (essentially) any real function to arbitrarily close tolerance, there's some neural network which will do it, provided we'll give it enough nodes. The function we want to approximate is D, who's generalization error would be zero, so the infimum of error rates over all possible neural networks is, in fact, zero. As we let the number of nodes in our network grow, we get larger and larger classes of networks, which contain successively closer and closer approximations of D. More complex networks require more samples than simple ones to reach the same level of probable approximate correctness, which in this case means probable approximate truth, but one can still push the probability and the approximation up arbitrarily high. We don't know the best attainable error for any given degree of complexity, but we don't have to --- we just have to know that it's going down (which we do), and know the rate of convergence of the best in-sample network to the best possible network (which only depends on the class of networks, through a quantity called the VC dimension).
So here is our final strategy ("structural risk minimization") for learning something which is probably approximately a decision procedure for a formally undecidable theory T in the language L. Start with a big sample of random sentences in L. For each degree of neural network complexity (one hidden node, two, etc.), find the network with the best in-sample error rate. Pick a confidence level p, and discount the in-sample performances by the expected degree of over-fitting, which depends on the VC dimension and on p. Chose the network with the smallest discounted generalization error. Then by (e.g.) Theorem 4.1 in Vapnik's Nature of Statistical Learning Theory, the resulting network's generalization error converges in probability on the infimum of the error over all possible networks, which we've just shown is zero. Hence the network converges in probability on D, even though, at every stage, the network is a computable function. Q.E.D.
This procedure is easily modified so as to eliminate the possibility of either false positives (non-theorems classified as theorems) or false negatives (theorems rejected as non-theorems), as desired. Furthermore, it would be easy to extended it by applying any of the various methods for combining classifiers known in the literature, e.g., boosting.
*: Yet more formally, we can state the theorem as: No theory can be at once consistent, complete, axiomatizable and an extension of Peano arithmetic. Hence, if T is axiomatizable (which it is, by hypothesis), and an extension of Peano arithmetic (which it is, by hypothesis), then it cannot be both consistent and complete. So T must be inconsistent or incomplete. (Back)