The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   111

Robot Shaping

An Experiment in Behavior Engineering

by Marco Dorigo and Marco Colombetti

Intelligent Robotics and Autonomous Agents series, vol. 2.
MIT Press, 1998

Crawling Towards the Light

This book is about how to get (fairly) real robots to learn to do things by trial, error and reward --- by ``shaping,'' a word borrowed from B. F. Skinner, along with the vocabulary of ``reinforcement.'' Teaching --- i.e., explicit programming --- is eschewed. Mostly it is concerned with fairly specific actual experiments, but there are some more general remarks here, and the work itself is exciting.

Dorigo and Colombetti (and their collaborators at the Politecnico di Milano) mostly worked with fairly small mobile robots, equipped with sensors and motors, controlled by software running on desktop computers, to which the robots were connected in real-time by various sorts of modem or cable. (Sometimes they merely used simulations of such machines.) Those controls were classifier systems. Classifier systems have, basically, three parts: a list of input messages, a list of output messages, and a set of classifiers. Each classifier consists of two parts, a condition and an action. Each classifier scans the list of input messages, and when it sees one which matches its condition, it is triggered, and adds its action to the list of output messages. Input messages come either from sensors, or from the output of other classifiers; output messages are intended either for the system, or for the effectors. In the case of conflicts, the action taken is decided by comparing the strength of the rules proposing the different actions (the stronger rule generally but doesn't always win, and gets its strength reduced). When conflicts have been resolved, the output messages are sent to the effectors, and the internal memory is wiped clean for the next time-step. A genetic algorithm produces new classifiers from ones which have high strengths.

The trick is to make useful classifiers have high strengths. What Dorigo and Colombetti do is to offer up a constant stream of positive or negative reinforcements to the robot; these add (or subtract) to the strength of rules which caused the latest action, and parts of those changes in strength get passed back to the classifiers which caused them to be active, and so on. (This is called the ``bucket brigade'' procedure; like classifier systems and genetic algorithms, it's a brainchild of John Holland.) A separate piece of code, the reinforcement program, is responsible for delivering these rewards and punishments. (Sadly, rewards and punishments are more effective than rewards alone.) Having a good reinforcement program is obviously a key to successful learning, and we'll return to the principles of designing one presently. For now, just note that it exists, and alternately whacks the robot and tosses it a biscuit.

There are lots of ways to implement classifier systems, usually known by uninformative acronyms ending in -CS. Dorigo and Colombetti use an implementation of their own, called ICS, the I standing optimistically for ``improved''; it modifies the genetic algorithm somewhat in ways that, empirically, seem to speed up learning. Moreover, they've parallelized ICS (they call the result ALECSYS), and run it on parallel hardware (transputers). Actually, in most applications they use Alecsys to run several separate ICSs, each one trying to learn to do a separate piece of the task. (We'll return to this presently, too.)

This basic sort of classifier system retains no memory from iteration to iteration. To teach behavior sequences, successive actions need to either modify the environment enough to cue the next action (cf. James, Principles of Psychology, ch. iv), or the robot needs to retain some kind of internal ``state''. (Our authors point out that, at least when this state is very simple and unstructured, it's much of a muchness whether we think of it as ``memory,'' ``representation,'' ``goal'' or the like; ``state'' will do just as well, and carries less baggage.) Given some kind of internal state, which can be altered either by the experimenter (which seems, frankly, like cheating), or by classifiers themselves, Alecsys is capable of learning sequences of behaviors without environmental cuing. It may, however, be necessary for the reinforcement program to take into account the current value of the state.

The actual tasks the robots are given have a somewhat classic feel to them --- follow lights, run home in the presence of noise, gather ``food'' and return home, hunt around for a light hidden behind an obstacle. (Readers of Braitenberg, or even of W. Grey Walter, will feel at home.) The robots themselves have motors turning their wheels at various speeds, photo-sensors which report light intensity averaged over broad fields (generally 180 degrees), bumpers, directional sonar. (One model, AutonoMouse II, featured on the cover, is cute enough that it would probably sell well as a toy. The others are less fetching.)

The advantage to analyzing (or, as they say, ``factoring'') complex behaviors into simple ones, and then assigning a separate classifier system to each basic behavior, is that it drastically contracts the size of the search space, and so speeds up learning, even when we include the time needed to train extra classifier systems to coordinate the basic ones. (Cf. Simon, Sciences of the Artificial, ch. 8.) Consider a robot which is supposed to chase after lights, except when it hears a loud noise, in which case it should run for shelter. Rather than trying to train up this behavior pattern as a whole (``holistic shaping''), it is much easier to first train one module to follow lights and another to run from noises, and then a third which decides whether to go with the chasing-module or the hiding-module. The reason such ``modular shaping'' works better is that we can build the following-module so it ignores noises and the position of shelter, and the hiding-module so it ignores lights, shrinking the space of possible classifiers they need to search through. (The coordinator doesn't have to worry about sensory input at all, really; it just has to learn that hiding trumps chasing.) In fact, the coordinator --- or coordinators, for even more complex behavior patterns --- can be trained after all the basic behaviors have been honed to an acceptable degree of skill, and then ``frozen,'' their learning algorithms disabled. Modular shaping works better with ``distributed'' architecture (multiple specialized classifiers) than it does with ``monolithic architectures,'' which, conversely, do better with holistic shaping. (``Distributed'' is perhaps not a happy choice of adjective, since a monolithic behavioral architecture could be implemented on distributed processing hardware.) The drawback to modular shaping is that it requires the reinforcer to have some knowledge of how the desired behavior is implemented by the robot. The authors advise very strongly that complex desired behaviors be analyzed into simpler ones, and that this be carried as far as possible, mostly for reasons of learning efficiency, but also so that the final behavior of the robot is at least partially comprehensible to its creator. (Another advantage, not remarked on that I saw, is the possibility of using the same basic behavior, or even low-level complex behavior, in multiple robots --- producing interchangeable behavioral parts.)

The other half of their ``behavior analysis and training'' method (or BAT) has to do with the design of the reinforcement program. If we are to engage in modular shaping, we need to first devise an appropriate scheme of reinforcement for each basic behavior --- which is often not too hard. The real work often comes when we need to devise the reinforcement schedule for the coordinators, since here we need to make sure that purely local decisions about what to reward will lead to correct global behavior. Hard as this is, ``solving'' the specification of the desired behavior for an explicit program which will implement it is generally much, much harder, so we still come out ahead by using reinforcement and learning. Moreover, the reinforcement program need only worry about the gross decomposition of the behavior into its components; how those are implemented by the controller is of no concern to it. In principle, the reinforcement program is completely indifferent to whether the modules of the controller are classifier systems, some other kind of production system, neural networks, pieces of Lisp code in a genetic programming system, hidden Markov models, or little blobs of green proteinaceous goo from Area 51. Our authors' use of classifier systems is, as they note, merely a matter of practical convenience.

(It would be nice if we could design a reinforcement program which just worried about the global correctness of the robot's behavior, rather than delivering constant small nudges which may or may not end up moving it in the right direction. Unfortunately, ``delayed reinforcement'' learning is a can of worms for many reasons. Probably the most intractable one is the ``credit assignment'' problem: how to decide which classifiers, perhaps active a long time ago, were responsible for this particular reward. Our authors basically shake their heads sadly over this, but I wonder if some of the techniques used to teach sequential behaviors couldn't be adapted to this end.)

The experimental results are impressive. The robots fairly swiftly achieve high degrees of performance --- naturally higher for the basic behaviors than for the complex ones. Simulated robots learn better than real ones: this was expected, because real components never work quite like they're supposed to, but the learning procedure is able to adapt to the idiosyncrasies of real components, and even fairly drastic mutilations (``lesions'') of the robot with success, if not perhaps aplomb. To show that the technique has no special affinity for their hand-built robots, they also report successful attempts to use it to teach commercially-available machines, including a vicious-looking industrial manipulator arm, here called CRAB.

This book is rather light on theory. (For instance, there are no quantitative explanations for why different learning regimes work better than others, just the empirical observation that they do.) The BAT ``methodology'' is basically the (sound) injunction to divide and conquer, plus (sensible) advice on some good ways to divide up your problem. (Some of this advice may be specific enough to turn into code; we'll see.) Still, ``behavior engineering'' is an idea whose time is coming --- there're simply too many uses for artificial autonomous devices (physical or computational) for people not to try to make them, and we will figure out design principles for them. Robot Shaping presents practical engineering advice in clear prose with a minimum of technicality, along with detailed and inspirational descriptions of experiments that worked. I earnestly recommend it to anyone interested in agent-based modeling, adaptive control and learning, and, of course, robotics and artificial life.


Errata: p. 53, for ``does not operate any information abstraction,'' read ``does not abstract any information''; p. 99, for ``reward received by the trainer,'' read ``reward received from the trainer''; p. 185, for ``rationality motivations,'' read ``rational motivations''.
203 pp., numerous black-and-white diagrams, a few black-and-white photos, end-notes, bibliography, index of names and subjects
Artificial Life / Computers and Computing
Currently in print as a hardback, ISBN 0-262-04164-2, US$37.50, LoCTJ211.35 D67
17 November 1999