The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   116


Processes of Inference, Learning, and Discovery

by John H. Holland, Keith J. Holyoak, Richard E. Nisbett and Paul R. Thagard

Computational Models of Cognition and Perception series

MIT Press, 1986

The Best-Laid Schemes o' Mice an' Men

A brief description of this book (like the one on its cover) sounds like the beginning of a very academic shaggy dog story: "So one day, see, a computer scientist, a philosopher and two psychologists write this book about induction..." While there are some amusing bits (for a very academic value of amusement, admittedly), this is a serious book, and in fact one of the best I know of on how induction actually works in animals and might be made to work in machines. Our authors are good American pragmatists; for them induction isn't a contemplative process of discovering patterns in the world so much as an active one of learning how to get the better of it. In part they want to design things which will do this, and in part they want to understand how animals already do this; what follows will similarly mix forward and reverse engineering. How then do we design an inductive beastie?

The first forward requirement is that our ideas should be implementable, that they shouldn't involve any hand-waving or "and then a miracle happens." But that means induction needs to be accomplished by effective procedures, i.e. by computational systems. So we need to choose a computational architecture to start with. After due consideration, they reject both Good Old-Fashioned AI (too brittle) and artificial neural nets (too amorphous), settling on something inspired by "production systems" in computer science but formally distinct. The main elements are messages, and rules which act on them to produce new messages. Some messages come from sensors tuned to the environment; some messages trigger effectors, which do things to the environment. All the rules are of the form "If there is a message (or combination of messages) with property X, then post a message with property Y". Rules which get to post their favorite message are said to be "active." In a standard production system, the rules would be consistent, and only one of them would be active at any one time. But these are very restrictive assumptions, and our authors explicitly repudiate them; this means that fully-specified beasties will need ways to decide between contradictory rules, and to decide which rules get activated at any one time. In general they propose two such mechanisms. One is that rules with stricter conditions should trump more latitudinarian ones, effectively giving a set of nested rules-and-exceptions which they call a "default hierarchy". (As Brazilian president Getúlio Vargas is supposed to have proclaimed "For friends, anything; for enemies, the law".) The other choice-mechanism is that rules should have a "strength" which indicates how well they have done in the past, and stronger rules should be more likely to be activated than weaker ones. (If we have several rules with the same antecedent but mutually exclusive consequents, their strengths effectively capture a conditional probability distribution, without, however, mapping on to probabilities in any particularly simple way.) Letting rules have strengths forces more design decisions upon us: we need to figure out how rules get strengths in the first place, how strengths change (the "credit assignment" problem), and how strength and specificity interact.

Because the input of one rule can be the output from another, clusters of interconnected rules can represent concepts --- expectations about what features of the world go together, what features do not, and what to do about them. Combined with handling probabilities through strengths, such concepts can even include information about how much variability there is in patterns of association. If representations are allowed to persist over time, rule-clusters can even represent temporal patterns and engage in planning, or for that matter improvisation. This is made easier still by bridging rules, which activate other rules as well as keeping themselves active until goals have been achieved. (Rules can be activated by several different sorts of representation.)

What is needed to make a beastie like this inductive rather than just, as it were, responsive are means of producing new rules, perhaps just by tinkering with old ones. The better-suited the new rules are to past experience, the better the beastie is at induction. Clearly this will need a way of measuring how good a rule is. A hardy perennial criterion is that surprises are Bad Things, and the closer the fit between the (possibly implicit) prediction of the rules and reality, the better. If we want to our beastie to perform some particular task, say in an engineering application, then a different specification of utility would be appropriate. What real beasties look for in their rules is harder to figure out; our authors do rather well using just the hardy perennial.

Our authors give reasonably detailed computational specifications for two different machines which qualify as inductive in their sense: classifier systems and a program called PI. I've gone on about how classifier systems work at some length in an earlier review; the only real difference between those in the other book and those here is that the authors (mostly, I suspect, Holland) insist on allowing multiple rules to post messages, which, along with bridging rules, solves the delayed-reinforcement problem of Dorigo and Colombetti, or at least should.

PI is a more demanding machine computationally, but also one whose actions bear more resemblance to those of human thinkers (which isn't saying much). Unlike the classifier system, which acts on raw binary strings, PI rules and representations are Lisp expressions, and there are more structured representations of concepts, which are however still basically rule clusters. Again unlike the classifier system, which gets new rules by using a genetic algorithm on existing ones, PI has a number of specific inductive mechanisms --- specialization, generalization, abduction, and concept formation --- which are intended to be reminiscent of conscious experience, unlike the GA.

Classifiers are more often used in practice than PI --- in fact, PI seems to only have been used by Holyoak and his collaborators --- but both can give quite impressive results in the way of discovering both un-obvious patterns and useful rules of behavior. (In one experiment described in ch. 11, PI in effect invented the wave theory of sound, à la Vitruvius, De Architectura, V 3 vi f.) I think we can take these successes as proof-of-concept: such rule-based mechanisms can do induction, and one obvious direction from here is to make rule-based beasties that induce better.

No doubt, says the voice of skepticism, this is all very inspiring for the engineers. But is there any reason to think that induction in the wild, or even in the psych. lab, is also rule-based? To dispel these doubts, our authors turn to the two pillars of psychological knowledge, the rat and the American undergraduate. Let's start with the argument from the rat, which is the one that convinced me.

Since the early 1970s, it's been pretty securely established that conditioning, whether of the classical-Pavlovian or the operant-Skinnerian sort, is not adequately explained by the theories of Pavlov and Skinner. The theory which displaced theirs, due to Rescorla and Wagner, is a bit more complicated but in rather better empirical shape. In essence (I shan't go in to the details) it views conditioning as an animal detecting covariation among those variables in the environment which it's primed by evolution to attend to (which it "encodes", as our authors say). It's a very slick theory which accounts for lots of experimental and even field results in many species (not just rats). What our authors show in their ch. 5 is that a rule-based theory, paying attention only to predictive success, can not only account for everything the Rescorla-Wagner model gets right, but also mop up the phenomena it gets wrong. This is, I think, very powerful evidence that our lab animals rely on things awfully like predictive rules in learning and decision-making. And it's usually a safe bet that what rats can do, undergraduates can do also...

In fact the argument from undergraduates rests not so much on their ability to learn as on their peculiar, reproducible and stubborn failures to do so. It's difficult to teach them the basic principles of physics (freely moving objects travel in straight lines) or social psychology (situations having more effect on behavior than attitudes) and make them stick. Students may learn to repeat the facts well enough on exams, but when it comes to thinking in daily life, one might as well have been talking to an equal mass of frozen halibut (which, indeed, early-morning classes sometimes resemble). Now in these cases, students come to us with strongly-entrenched ideas about how the relevant bits and pieces of the world works; that is (say our authors) they have high-strength rules about how bodies move and about what makes people tick, rules which work well enough in daily life but are generally not available for conscious inspection and do not survive critical testing. The global inadequacy of these rules will not cause them to be replaced by something better; at most, the points where they break down will be patched by more specialized rules geared to those exceptions.

This doesn't mean that default hierarchies and the rest are bad; indeed, they have many virtues, not least that they can strike a good balance between accuracy and parsimony. It's even hard to fault the if-it-ain't-broke-don't-fix-it behavior of students' learning mechanisms, particularly when it comes to domains which literally involve day-to-day survival (as intuitive physics and intuitive psychology manifestly do). The problem is that the local tinkering these mechanisms encourage makes it hard to implement a radically new set of rules, even when they would be both more accurate and more parsimonious than the old ones. What seems to be called for in these cases is convincing and repeated demonstration that the old rules really are inadequate and cannot be patched up; as our authors note, this is both easier and more ethical to arrange in the physics classroom than in that of psychology.

One can argue that intuitive physics and psychology are, under ordinary conditions, very good at prediction and explanation, while being simpler to work with than scientific theories, so that retaining them is rational, much as it's rational for engineers to use classical mechanics in their calculations rather than relativistic theories. Maybe our students are demonstrating the Cunning of Reason, rather than cautious learning mechanisms stuck in a local optimum. But this sort of argument can't apply when the entrenched, ineradicable notions are not, in fact, good at prediction and explanation, and our authors offer several examples of this, in both substantive and formal domains. One of the substantive examples is particularly gross, and gives me a chance to ride one of my hobby-horses, so I'll deal with it before getting to the formal examples and (I promise) bringing this review to some sort of conclusion.

Consider the work of clinical psychologists Loren and Jean Chapman on psychiatric diagnosis (pp. 191--195). In the 1960s, it was still common for psychiatrists to use "projective" tests, most famously the Rorschach cards, in diagnosis; for instance, to tell whether patients were homosexual. Empirical studies showed that card-responses --- "signs" --- which seemed, to Freudian intuition, to indicate homosexuality had no real validity, but that there were a few unintuitive signs which did correlate (slightly) with homosexuality. The Chapmans showed that even very experienced clinicians put great faith in the plausible but invalid signs, and ignored the valid but unintuitive ones. They also showed that clinical judgment about what the signs meant didn't differ significantly from that of laypeople asked to classify them at high speed. In other words, clinical judgment here was simply background cultural prejudice, preserved by ad hoc exception-handling much as intuitive physics is. The Champans went on to show that if you removed all the plausible, misleading signs from the ones laypeople were exposed to, they would, eventually, begin to twig to the valid signs. (They didn't run that experiment on psychiatrists.) That is, once the rules involving plausible signs had no chance to be exercised, normal inductive propensities did begin to pick up genuine covariation. In addition to bringing home, once again, the fact that shrinks do not know what they are talking about (the aforementioned hobby-horse), this work suggests a way of getting at the cognitive roots of ideology --- the mechanisms which allow utter bilge to sustain itself in the teeth of contrary experience almost forever. But of course the reason why that bilge was generated and accepted in the first place will be not so much cognitive as emotional and cultural.

Finally, rules of inference, logical and statistical. Our authors claim that people rarely use the purely formal or syntactic rules of deductive logic, employing "pragmatic reasoning schema" instead. They argue as follows. Suppose people did employ strictly formal inferential rules, operating on substantive premises. Then if we gave experimental subjects two problems which differ in substance but can be solved by applying the same deductive rule, they should do equally well at both problems; performance should be content-neutral. But it's easy to find pairs of problems where this isn't the case, and to show that the difference isn't due to (say) one topic being more familiar and so easier to apply known rules of logic to; the crucial difference seems to be the subject-matter of the problems. For instance, people correctly apply the rule of inference known as modus ponens to problems when they are couched as being about catching people cheating on social conventions, even if the conventions are very foreign to them, but flub logically-equivalent problems about abstract rules. Pragmatic reasoning schemata, then, are a kind of hybrid of formal and substantive information --- "If you're looking for someone cheating on rule thus-and-such, then do this-and-so" --- and their use would explain why people are at once successful in negotiating daily life and bad at reasoning abstractly or in unfamiliar contexts. This would also explain why logic courses rarely do much to make people reason better, while training in substantive areas does help people think straight --- about those subjects.

More provocatively, our authors suggest that rules implementing purely-formal deductive logic may actually be less accurate than pragmatic ones. Consider the following train of thought:

I once landed at a seaport in a Turkish province; and, as I was walking up to the house which I was to visit, I met a man upon horseback, surrounded by four horsemen holding a canopy over his head. As the governor of the province was the only personage I could think of who would be so greatly honored, I inferred that this was he. [C. S. Peirce, quoted by Ian Hacking, The Taming of Chance, p. 207]
This sort of thing, which Peirce called "hypothesis," or (later, and less felicitously) "abduction," is very useful to us, and presumably any beastie dealing with a world with where some causes are not directly observable. It even has a simple formal structure: If A, then B, and B, so A. The only problem is that it's deductively invalid, and is actually what logic teachers call the fallacy of affirming the consequent. (Maybe part of a traditional wedding in that province is for the groom to ride to the bride's house while four friends hold a canopy over him.) Yet a rule-cluster which combines accurate information about environmental dependencies with hypotheses will do better than one which scrupulously avoids affirming the consequent...

These last few paragraphs may have made Induction seem more pessimistic about human knowledge than it really is. Our authors are not interested (as the poet says) "in proving foresight may be vain". They are fascinated that foresight is so often not vain, and examine its failings to try to better understand it (much as we learn about the workings of the brain by studying the victims of stroke and head-wounds). This understanding has intrinsic value, and also holds out the practical hope of making things that learn well, and even of improving our own thinking.

This is an extremely rich book, and I can't do justice to everything in it without going on to truly unreasonable lengths. I am therefore going to pass over in silence the fascinating chapters on analogy and on scientific discovery. I am going to punt on where this book fits in the internal debates of AI and cognitive science (GOFAI vs. connectionism vs. embodiment vs. ...), and on what it can tell us about the philosophical problems of knowledge and induction. I am even going to hold my tongue about how to relate this work to more rigorous and limited approaches to induction and pattern-discovery arising from statistics, computational learning theory and dynamics. I will say that people interested in any of those subjects --- as well, of course, as the ones I have talked about --- will find the time they spend reading Induction well rewarded.

Disclaimer: I know John Holland slightly, and have admired his work for years. I have no stake, however, in the success of this book.
xvi + 398 pp., numerous black-and-white diagrams, bibliography, index of names and subjects (analytical for subjects)
Artificial Life and Agents / Cognitive Science / Computers and Computing / Philosophy of Science
Currently in print as a hardback, US$42, ISBN 0-262-08160-1, and as a paperback, US$24, ISBN 0-262-58096-9, LoC BF441 I53.
5 March 2000; modified 7 October 2009 (thanks to Nicolás Della Penna for answering a trivia question)