The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   120

Truth from Trash

How Learning Makes Sense

by Chris Thornton

Complex Adaptive Systems series
MIT Press, 2000

Two Cheers for Trash

We have here two books for the price of one.

The first is an engaging and fairly non-technical survey of the field of machine learning, somewhere between the level of New Scientist and a friendly intro. textbook (which doesn't seem to exist for the field). This book is excellent and highly recommendable, if sometimes a bit too cute for its own good.

The other book claims to draws out the implications of machine learning, its achievements and limitations, for a host of Deep Issues --- induction and the reliability of knowledge, the nature of the scientific endeavor, questions of representation and embodiment in cognitive science, the nature of pattern or regularity, even the reality of theoretical entities. It would be kind to describe the thinking displayed in this book as ``sloppy''.

Unfortunately, it is not easy to disentangle the two books. I'll start with the good one.

The first big point Thornton makes is that machine learning is equivalent to prediction. That is, whenever we can specify some learning task sufficiently precisely to get a machine to do it, it's equivalent to having the machine learn to do predictions of a certain kind. We always want the machine to learn a rule of the form ``If X, then Y''; which is the same, from its perspective, as learning to predict Y when it sees X. In what is called ``concept learning,'' for instance, the output Y is a binary variable, 1 if X is an instance of a certain concept (e.g., ``edible fruit'') and 0 otherwise. Learning the concept is equivalent to learning to predict 0 or 1, depending on X. And so on for more complicated classifications, regression, learning behaviors, and so forth. All this is quite sound. I actually thought this went without saying, but Thornton evidently felt a need to convince his fellow computer scientists of it.

The second big point is that, for a broad class of learning/prediction problems, we have methods which work quite well, and are pretty simple at their core. The easiest is just what's called ``nearest neighbor'': given an unknown input, predict the output of the most similar input whose value is known. One can elaborate such similarity-based methods considerably, but in essence they all involve what Thornton calls ``fence-and-fill'': carve up the input space into simple blocks (with ``fences''), and treat every point in the block the same way (``fill''). The refinements of fence-and-fill methods have to do with how partitions are created and shifted in response to training corrections. (A particularly simple case of fence-and-fill learning is dividing the input space by straight lines; if this is successful, the problem is said to be linearly separable, and this special case seems to motivate a lot of Thornton's thinking, as we'll see.)

The third big point is that some problems are not very amenable to fence-and-fill learning. He calls these ``relational'' problems, which I think is a bad name --- but let us leave that for discussing the other book within these covers. The prototype of these problems, to Thornton's mind, is parity: given a number, guess whether it is even or odd. The kicker here is that guessing it's the same as its nearest neighbor is always wrong, and no finite number of (contiguous) partitions will work. If fence-and-fill won't work, what will?

Two traditional strategies that Thornton considers, but dismisses as not really adequate, are what he calls ``supercharged fence-and-fill'' and ``pick-and-mix''. The first uses various tricks to make fence-and-fill easier to apply, in the hopes of getting some mileage out of it. One can expand the dimensionality of the input space (which in a sense makes it more likely that it clusters), or increase the number and complexity of the fences available. On the other hand, pick-and-mix seeds the learner with some basic relations which it may find helpful, and has it construct increasingly elaborate combinations of these relations in the hopes of predicting more accurately. The problem with supercharged fence-and-fill is that it doesn't really address the nature of the problem (or so Thornton says); the problem with pick-and-mix is that it demands prior knowledge of what is likely to work in this problem domain. Thornton regards neither of these as satisfactory.

What he does regard as satisfactory is the ``truth from trash'' procedure of his title. It works like this:

  1. Use the chosen fence-and-fill method to process the current input data, that is, to find regions of uniformly labeled stimuli.
  2. Recode the current input data so as to draw identically labeled data points closer together. ...
  3. Assign the derived data to be the current data and apply the procedure recursively until data are generated that can be satisfactorily processed solely by the fence-and-fill method. [p. 171]
As Thornton notes, truth-from-trash does not have to assume that the clusters it develops at each stage are meaningful in themselves, just that they are going to be useful in deriving a yet-more-accurate cluster at the next step. Thornton doesn't prove that this kind of ``proto-representational learning'' will do any better than supercharged fence-and-fill, much less than appropriately chosen pick-and-mix, but I agree that it does have a better feel to it that the former, and avoids the latter's need for expert priming. (Curiously, Thornton does not refer to any more formal publication on truth-from-trash, either here or on his web-site.)

One of the nicer aspects of truth-from-trash is that it shows how to construct the kind of abstract, decoupled, ``symbolic'' representation beloved of Good Old-Fashioned AI from the more dynamical, input-driven, connectionist devices of ``nouvelle'' AI. (Pace Thornton, truth-from-trash is a symbol system within the meaning of the act, or at any rate of Herbert Simon's definition. So are connectionist systems.) This is all to the good, since (as emphasized by Daniel Dennett, Andy Clark, and others on the side of the angels) cognition and the nervous system plainly employ both reactive, embodied, nouvelle-ish devices and abstractive, symbol-manipulating, disembodied devices, and no good can come of pretending that one or the other doesn't exist, or that they can't work together. *

Let me now turn to the bad book, which begins about where Thornton starts talking about hard learning problems, ones fence-and-fill doesn't work very well on. As mentioned, he calls them ``relational,'' but it will become clear why I think that's not a very good name for what he has in mind. He offers three characterizations of these problems, one of which is definitely not equivalent to the other two, which in turn are not obviously equivalent to each other, though they may be. He does not use ``relation'' in the mathematical-logical sense of a set of ordered pairs (or more generally n-tuples). After all, what similarity-learning does is learn a relationship between the input values and the output, in this sense. So he needs to mean something more.

The first characterization he offers of what he means is that a relational problem is one where the absolute values of the inputs are uninformative as to the output, but the presence or absence of some relation among the inputs is. This can be characterized perfectly straight-forwardly in information-theoretic terms (though he does not do so): a problem is perfectly relational in this sense if the mutual information between the output variable and any one of the input variables is zero (and the mutual information between the output and the whole set of the input variables is maximal). Moreover, this certainly captures an obvious sense of ``relational''. It's not obvious that this must be much harder than what similarity-learning does, however.

The second characterization is given by his measure of ``geometric separability,'' defined as ``the proportion of data-points whose nearest neighbors share the same output''. This clearly means that problems with low geometric separability are not amenable to similarity-methods, and perhaps not to fence-and-fill at all (though I'd be much more cautious about that, absent a proof, than he is). But this is not equivalent to his first characterization (a counter-example will be presented presently).

Finally, he characterizes relational problems as those where the proper partitions are not ``geometric or spatial'' (p. 119). Absent a characterization of what he means by ``geometric or spatial,'' this merely indicates an impoverished view of geometry! (I say this despite being an almost entirely verbal thinker.) Given his preoccupation with linear separability and the like, I guess that what he has in mind is something on the order of problems where the correct partition cannot be represented by a finite number of simply-connected compact sets.

Let me demonstrate that his characterizations are not equivalent with a counter-example. Consider the relationship <, ``less than,'' among integers (positive and negative). Our input space is now the x, y plane, and we color a point red if x < y, and white otherwise. Every point below the 45 degree diagonal will be white, and every point on or above that line will be red. Now, clearly just knowing the x coordinate of a point tells us absolutely nothing about its color. Thus, as we'd hope, learning the relationship < is a relational problem in sense (1). Unfortunately, in senses (2) and (3), it is not relational. Confined to a finite range of inputs, the geometric separability is very high, since only points on either side of the line x = y have neighbors who differ in value from themselves; in the limit that we allow x and y to range over all integers, the geometric separability approaches 1. And not only can the problem be tackled geometrically, it is even linearly separable.

I conclude that what Thornton really needs is either (2) or (3), but that neither of these should be called ``relational''. (I do not have a handy counter-example to the equivalence of senses (2) and (3), and for all I know they may be equivalent.)

There is also a numerical argument made, that in a non-relational problem the number of partitions to be considered is finite, whereas in a relational one it is infinite. This is not so, but he seems to have been thinking of the following point. The number of binary partitions of a single variable, which has N possible values, is the number of subsets of the input space, which is 2^N. The number of possible relations among two variables, taking on N and M possible values, is 2^(NM), since it is just the number of subsets of the product space, which has cardinality NM. For realistic values of N and M, 2^(NM) is much, much larger than 2^N (or 2^M for that matter). But really in the case that concerns Thornton, the number of values of the input vector as a whole is what's relevant in the first case, and that's NM, so the number of possible partitions is the same.

Whatever you call these problems, they clearly exist and need to be solved. Supercharging exploits the fact that many problems will partially yield to fence-and-fill, and regards that as better than nothing. Thornton is rightly skeptical of this, and has an elaborate parable of a poker-playing machine to make his point. Pick-and-mix will work, but only if the primitive relationships it's equipped with are adapted to the task. Thornton seems to doubt that we can arrange for such adaptation in general, as well as rejecting it on the grounds that it doesn't guarantee learning will work. Unfortunately, as he himself points out, one can show that no such guarantee is possible. For any problem where a given learning method does better than blind search, there is another problem where it does worse, and one cannot guarantee, in a completely a priori fashion, that a malevolent demon (or, as the computer scientists suggestively put it, an Adversary) will not arrange for us to always try our luck on problems ill-suited to our tools. This is what is called the ``No Free Lunch'' theorem in machine learning. Its relevance to the actual world is, to say the least, not immediately obvious, since we do not sample the space of all possible learning problems uniformly, or even under the direction of an Adversary. It is thus, pace Thornton, not at all absurd to suggest that we can learn what learning algorithms have above-chance performance on the kinds of problems we actually encounter. In other words, the background knowledge for pick-and-mix is acquirable. (I learned this counter to the no-free-lunch argument from John Holland; so far as I know it's original to him.)

That the products of one level of learning or neural processing are the raw materials of subsequent levels, which Thornton makes out to be something of a breakthrough associated with truth-from-trash, is in fact a very old notion. He actually quotes John Locke to this effect, without quite realizing, it seems, how ancient and commonplace this notion is. I will merely repeat one of my favorite passages from William James, who puts it quite nicely:

[T]he mind is at every stage a theater of simultaneous possibilities. Consciousness consists in the comparison of these with each other, the selection of some, and the suppression of the rest by the reinforcing and inhibiting agency of attention. The highest and most elaborated mental products are filtered from the data chosen by the faculty next beneath, out of the mass offered by the faculty below that, which mass in turn was sifted from a still larger amount of yet simpler material, and so on. The mind, in short, works on the data it receives very much as a sculptor works on his block of stone. In a sense the statue stood there from eternity. But there were a thousand different ones beside it, and the sculptor alone is to thank for having extricated this one from the rest. Just so the world of each of us, how so ever different our several views of it may be, all lay embedded in the primordial chaos of sensations, which gave the mere matter to the thought of all of us indifferently. We may, if we like, by our reasonings unwind things back to that black and jointless continuity of space and moving clouds of swarming atoms which science calls the only real world. But all the while the world we feel and live in will be that which our ancestors and we, by slowly cumulative strokes of choice, have extricated out of this, like sculptors, by simply removing portions of the given stuff. Other sculptors, other statues from the same stone! Other minds, other worlds from the same monotonous and inexpressive chaos! My world is but one in a million alike embedded, alike real to those who may abstract them. How different must be the worlds in the consciousness of ant, cuttlefish, or crab! [Principles, vol. I, ch. IX, p. 288]
Thornton regards this by-now elementary fact as delivering a crushing blow to the objectivity of perception and to scientific realism. Of course it does nothing of the kind; it is perfectly possible that the ``mental products'' we have constructed are, in fact, the right ones, the ones in which the world is best described. Iinformation-theoretically, that would mean that our concepts are the ones in which prediction is both maximally accurate and minimally complex.

Thornton rightly connects the kind of change-of-representation which truth-from-trash works with creativity, or at any rate with one aspect of it. His speculations about the sources of aesthetic pleasure seem eminently dismissable (art does not aim at accurate representation, and probably for most of its history creativity as such has not been valued), and he doesn't seem to be aware of a large and respectable literature on computational modeling of creativity and change-of-representation, e.g., the excellent book of Margaret Boden (who is, as it happens, Thornton's colleague in the evolutionary and adaptive systems group at the University of Sussex!). The quotations from Einstein about the superiority of imagination to knowledge are more fit for dorm-room posters than serious discussion.

The trick of mushing the input space into a feature space where easy learning methods work better is not original to truth-from-trash. Thornton surely knows this --- it's featured in tomes on pattern recognition since time out of mind, i.e., the early 1970s at least, and is the core of the ``support vector machine'' approach of Vapnik and co. --- but I think any naive reader would come away thinking otherwise. The main novelty here is the emphasis on a recursive, data-driven recoding, whereas, e.g., the SVM crowd tend to emphasize the use of fixed, hand-crafted kernels.

Outside his immediate domain of machine learning, Thornton is not an altogether reliable informant. For instance (pp. 45--46), after correctly giving the formula for the (Shannon) information contained in a random variable, he tells us that this equation is ``identical --- save for the leading minus --- to the entropy equation from statistical thermodynamics. Hence the name `negentropy' '' (his italics) for information. Alas, as a glance at any text on statistical mechanics would show, we physicists use the minus sign too, and in fact thermodynamic entropy is the Shannon information of the atomic-level physical variables. (This is actually a rather profound and important point in our discipline.) --- The name ``negentropy,'' of course, arose in quite a different way, and is in any case now obsolete, e.g., it does not appear in the index to Cover and Thomas's definitive book.

Or again, consider his description of Karl Popper's philosophy of science (pp. 138--139; all italics Thornton's):

Popper argues that since confirmation of inductively derived theories and laws is out of the question, science should proceed by focusing on refutation. A single counterexample is sufficient to falsify a theory with absolute certainty. Science can therefore build secure foundations by accumulating falsifications. The methodological implications follow automatically: respectable scientists should attempt to select new theories according to their degree of falsifiability rather than according to their degree of plausibility.

Unfortunately, although Popper's scheme has a certain zany appeal, it is not entirely practical. If we take his proposal literally, there is nothing to prevent us --- as scientists --- from selecting a long sequence of incorrect theories. Provided each theory is falsifiable, we are proceeding in an entirely proper, Popperian manner. But the net effect is, of course, that no real progress is ever made.

Now, one can justly criticize Popper's philosophy of science on many grounds, but claiming that it will not prevent us from making mistakes is hardly one which can be taken seriously. More importantly, this completely misses Popper's point, which is that we should prefer theories which have survived a long series of serious attempts to falsify them --- theories which would not have survived those attempts, had they been false. A scientist who merely succeeds in falsifying lots of stupid theories (e.g., ``The Moon is made of green cheese''; ``The Moon is made of blue cheese''; ``The Moon is made of munster''; ``The Moon is mostly cheddar, except for the Sea of Tranquility, where it is parmesan'') is not proceeding in a good Popperian manner.

The discussion of quantum mechanics is appalling. Wave-particle duality is a thoroughly obsolete notion, a needlessly misleading way of saying that models of quantum phenomena have features reminiscent of models of classical waves and of models of classical particles. But they are mathematically consistent beasts in their own right, and saying that this is paradoxical, much less saying that they're both waves and particles as observation demands, is a bit like saying that human beings exhibit squirrel-ostrich duality --- we're squirrels in that we are mammals, and ostriches in as much as we are flightless bipeds. (The poet is right when she calls us ``hybrids of plants and ghosts,'' but that's metaphor, not ontology.) Nor does the uncertainty principle have anything to do with measurements disturbing the measured; it doesn't have anything to do with measurements at all! The worst of it is, he doesn't need anything about quantum mechanics at all! Thornton is trying to make the point that good scientific theories generally do not make sense of things using our everyday, ``lifeworld'' concepts, a point which can be made by using any developed science. Classical mechanics would do: those who have taught it, or remember learning it, or know the history of its development, can all testify that its notions are not intuitive ones. The same is true, as I said, of every science which has reached a creditable stage of development; employing intuitive notions (as do, e.g., some branches of sociology and psychology) is a prima facie sign that the discipline is not genuinely explanatory at all. But I would say that this is, if anything, reason to think that scientific concepts do tell us something about reality, independent of ourselves. If we were just fooling ourselves, physics and biology would look like psychotherapy, or literary criticism, or libertarian political theory, which once upon a time --- before 1600, roughly, for physics --- they did.

Let me finally come around to the Deepest Issue that Thornton deals with, namely induction and the reliability of empirical knowledge (ch. 10). The quandary is this. All our knowledge of the real world --- of ``matters of fact and existence,'' in David Hume's words --- derives from empirical evidence and induction. But induction cannot be deductively justified a priori, cannot be justified from an argument with an empirical premise (on pain of regress), and cannot even be justified inductively, on the grounds that it's generally worked in the past, since that is transparently a vicious circle. (Skeptical arguments against induction are ancient; Hume seems to have been the first to give the vicious-circle argument.) Therefore, induction cannot be justified at all, and we have no real reason --- as opposed to mere animal faith --- to think that any of our generalizations about the real world will continue to hold.

Bertrand Russell once said that a special section of Hell is reserved for those who claim to have refuted Hume on this point. Thornton is not in danger of perdition (not on that count anyway), but a longish spell in Purgatory seems fair. He claims that the Humean predicament can be evaded by worrying about data compression instead. We know that we can compress data, and we can easily tell whether one compression scheme works better than another on a given hunk of data. Every compression scheme more or less explicitly works by extracting regularities, so where's the problem? But this misses the point, which is that we don't just want to compress this particular hunk of data; we want a way of getting regularities worthy of the name, ones which will continue to hold good for new data, and compression as such does not provide this. Certainly schemes like Rissanen's minimum description length principle do not. One can prove that, if certain regularities continue to hold, then a good compression scheme for our training data will, very likely, remain a good way of compressing test data, but that does not address the point.

In the same connection, Thornton justifies Occam's Razor, not as a heuristic but as a tautology, in a really extraordinary piece of reasoning:

Simpler inductive inferences necessarily make fewer assumptions about the data. Every assumption has an even chance of being invalid. Thus simpler inductive inferences will, in general, be predictively more successful. If it has any effect at all, then, the application of Occam's Razor within any sort of inductive process must improve predictive accuracy. [p.162, his italics]
This is absolutely astonishing, coming from somebody who certainly knows about the existence and importance of the accuracy-complexity tradeoff in modeling. If what he says here were literally the case, then we would have no reason to worry about overfitting... (A simple counter-example from my own field: When modeling stochastic processes, the simplest model, by a very well-defined metric, is the one which always treats the process as a sequence of independent, identically-distributed random variables, i.e. as biased coin-tossing or dice-rolling. This is admirably simple, but its accuracy is, in general, dismal.) So he can't mean this literally; but then what does he mean?

I'm only going to touch on the very strange idea that every assumption has an even chance of being wrong, by saying that I don't think it makes any sense. To quote C. S. Peirce, ``if universes were as plenty as blackberries, if we could put a quantity of them in a bag, shake them well up, draw out a sample and examine them,'' then we could give some kind of meaning to the chance of an assumption being wrong. (Even from a Bayesian, subjective-probability view, there is no reason to suppose that all those probabilities are 50-50, and indeed much, like the ``convergence to truth'' theorem, to doubt it.) This assumes that it's possible to individuate assumptions --- to find some sensible way of counting how many of them an inductive method makes.

Let me say that I don't think Thornton is a dumb man or even a bad researcher. On the contrary, I came to this book with high expectations because his professional papers show he is a good researcher. This book does not do him justice. He has attempted to combine popular exposition, a series of new technical proposals, and philosophical reflections in a single short volume, and, as might've been expected, all his projects have suffered. I have no doubt that he is aware of the bulk of the related research I have faulted him for not mentioning, and only very little doubt that he would have cited it had he not also been aiming for a popular audience. His carelessness --- and, as the case of negentropy shows, confabulations --- with material from other disciplines is much more troubling, but again, if he were writing for a more hard-bitten (not to say embittered) audience of researchers, I think that would be better controlled. On the other hand, he certainly can write popular science, though he is a bit too fond of using cute stories to make his points. What he cannot do (here I perform induction on a single instance) is succeed at writing for both audiences at once. I will certainly continue to follow his technical papers with every expectation of profiting from them, and if he writes purely popular books I'll happily read them. (Warning: an awful and unfair but inevitable pun follows.) As for this hybrid, I'm afraid that he has done too little to separate the truth from the trash.


Disclaimer: I asked for, and got, a review copy of this book from the publisher. Needless to say, I have no stake in its success.
Errata: p. 12, for ``commodities'' read ``securities''; p. 12, for ``wheat stocks,'' read ``wheat futures''; p. 35, Kepler had to make the shells of his Platonic-solids model of the universe thick to accommodate, not the elliptical nature of the orbits (which he hadn't recognized yet), but the the epicycles; p. 146, ``Max Planck discovered that excited electrons emit and absorb energy only in specific amounts, which he termed quanta'' (his italics): electrons, excited or otherwise, played no role in Planck's derivation of the black-body law and the quantization of the electromagnetic field (the black body spectrum, in fact, contains photons of all energies); p. 150, ``the best code for a data set is one in which the lengths of codes are inversely proportional to the frequency of the corresponding encoded sequences'' (his italics): actually they are proportional to the logarithm of the inverse of the frequencies (see, e.g., Cover and Thomas, chapter 5). In all URLs printed in the text, the tilde character ~ is rendered as a tilde accent over the next alphabetic character, as would result from the TeX error of writing \~x instead of $\sim$x.
xii + 204 pp., index of names and subjects (sparse), end-notes for each chapter, bibliography, numerous tables and graphs, a few black and white photos (or rather, quite poorly digitized reproductions of photos)
Artificial Life and Agents / Cognitive Science / Computers and Computing / Philosophy of Science / Popular Science / Probability and Statistics
Currently in print as a hardback, US$29.95, ISBN 0-262-20127-5, LoC Q325.4 T47.
19--20 June 2000
Thanks to Bill Tozier and Danny Yee for advice.