Singapore: World Scientific, 1989

Rissanen begins with a chapter trying to justify his approach, talking about
general issues of methodology and the meaning and ends of statistics. He
rejects the idea of a true distribution, even of a true model, but is curiously
willing (like the Bayesians) to take his data on trust. He equally rejects the
idea of using probabilities to represent degrees of belief; prior distributions
play a role in his system, as we'll see, but in a quite non-Bayesian way. The
aim of statistical inquiry is to "learn from the data", which he takes to
mean *find a minimal model* capable of generating the data; essentially,
we are to subordinate everything to what the old empirio-critics like Ernst
Mach, and Karl Pearson called "economy of thought". (He displays no
awareness of this line of intellectual descent.) If this is to be more than a
slogan, we need a way to gauge the size or complexity of models, which Rissanen
will presently supply. If we are to agree that this is the correct approach,
we need to be convinced either that minimal models are also going to be good or
even optimal for the other jobs we want to do with statistics, like prediction
or (to be very old-fashioned) explanation; or that those aren't jobs worth
doing. Rissanen takes the latter tack, but provides little more than bluster,
and conventional doubts about our ability to extrapolate from the past to the
future.

The second chapter gets down to mathematical business. He first defines
"codes" in general, as relations between one species of symbol (in which, say
we record our data) to another (say, binary strings), and some particularly
important kinds of codes. Probability distributions over symbols induce codes
with certain optimality properties --- basically, the most frequently occuring
symbols get encoded by the shortest strings. Equally, codes with those
optimality properties define probability distributions (in which the symbols
corresponding to the shortest code-strings correspond are deemed most
probable).. He then goes on to define a number of measures of information
and/or complexity in terms of coding schemes. The first of these is the
Shannon entropy of a probability distribution, what's usually called
information in communication-theory contexts --- essentially the mean length of
words in the optimal code. Then comes Kolmogorov or algorithmic complexity,
the length of the shortest computer program which will produce a
*particular* string. (We've gone over both in another place.) Neither of these, particularly
the latter, is satisfactory for purposes of statistical inquiry:

It has sometimes enthusiastically claimed that the algorithmic complexity provides an ideal solution to the inductive inference problem, and that "all" we need is to find an approximation to the non-computable algorithmic complexity and use the result to do prediction and the other inferences of interest. Well, this is a tall order, for there is nothing in a universal computer that helps us to find a good model of a string. In fact, if we already know the relevant properties of the string we can always write good programs for it, but we don't learn the properties by writing programs in the hope of finding short ones! The central task in inductive inference is and remains to find good models, and whether they are in the form of a computer program or models of a different kind, even non-computable [!], is quite immaterial. As a matter of fact, it is the main thesis in this text that by far the most useful models; ie, [sic] ones that we are likely to find most easily, are parametric probability models. We also suspect that in artificial intelligence, where models are to be found by a computer, the best strategy is to fit models in the probabilistic class, rather than embarking on a blind search for an approximation to the algorithmic complexity.This naturally leads to the question of how to define the complexity of a probabilistic model. Rissanen's suggestion is to first settle on a coding scheme

Chapter three formulates this "minimum description length" (MDL) principle
in a number of different ways (all at least asymptotically equivalent). All of
them involve a class of models M, and a scheme for encoding its parameters.
The coding scheme implicitly sets the probability of each member of M, and this
is our prior distribution. (The Bayesian prior, by contrast, records our
prejudices about the various models, including the ones "queerer than we
*can* suppose.") Since the members of M are themselves probability
distributions, each in turn defines an optimal code over possible data-streams.
So the description-length of a particular data-stream *x,* relative to a
given model *m,* is the description-length of *x* in *m*'s
code, plus the description-length of *m* in M's code (plus some small
corrections). We can thus find the member of M which minimizes the
description-length of a given data-stream, and if we have several classes of
models we can find the model-class with the minimum description length. An
interesting variant on this is to look at "predictive coding schemes", ones
which effectively try to guess what the next item in the data-stream will be.
Rissanen is tempted to adopt this as his basic criterion, but refrains, among
other things because of what we might call the Permutation City problem: "[One] defect
with the predictive code length is that the data must be ordered. There is of
course the very best order, the one for which the predictive code length is
minimized, but to find that we need to examine *n*! orders", *n*
being the number of data-points we possess. Rissanen particularly stresses the
ability of MDL methods to accommodate families of models with a variable number
of parameters.

Chapter four applies the MDL criterion to standard problems of statistical inference, i.e., to parameter estimation and to hypothesis testing. In both cases the standard techniques, due to Jerzy Neyman and Egon Pearson (Karl's son), are derided as unnatural, and we are told to always pick the parameter-value or hypothesis-class which gives us the smallest description length. (This reproduces the Neyman-Pearson result when testing a simple hypothesis against a simple alternative, but not for more complicated cases, and we lose the freedom to set a significance level.) In general the discussion here seems very forced, and very remote from the actual uses to which frequentist statistical inference is put in the lab. (For instance, one of his objections to Neyman-Pearson hypothesis testing is that, with probability 1, every hypothesis will eventually be rejected at any confidence level we please; readers may amuse themselves by detecting the fallacy involved.) Fair's fair, however: Rissanen does castigate various Bayesian excesses, particularly idolatry of the maximum likelihood principle.

Chapters five and six apply the MDL criterion to various ways of doing linear regression, and of predicting time-series, respectively. He can't resist claiming that his techniques give more accurate results than others, but betrays his principles somewhat by doing so. In case of time-series, Rissanen leans towards a variation on predictive coding called "predictive least squares". He also actually advocates re-arranging the series so as to make prediction as easy as possible, but this ignores the coding cost of re-ordering the series. Finally, chapter seven looks at the MDL solution to various kinds of classification problem.

If Rissanen's brand of statistical inference has substantial advantages over
the traditional sort, they elude me. Not only do we still need to guess at a
good model class (what Neyman called picking the "class of admissible
hypotheses"), we also need to guess at a good coding scheme for its members.
Generally, I don't see anything in either the MDL criterion, or Rissanen's
general *diktat,* to keep us from using totally different codes on each
distinct occasion, always giving the particular
data-stream we have in hand a very short length indeed. This procedure is
totally useless for doing anything but summarizing, for saying "thus and so"
concisely ("ecce data"?), but what short summaries we'll have! I don't of
course mean that Rissanen would approve of such behavior, just that he's left
himself no way to rebuke it. (Of course, with the right choice of model class,
this isn't a problem, but Rissanen has no principled way of forcing us to use a
decent model class.)

Rather than leading to the all-embracing reform of statistics which Rissanen
envisages, his ideas seem to me to be relevant to two limited (but important)
tasks. The first is statistical or "lossy" compression, the compression of
ensembles, which is handled badly, if at all, by traditional techniques. (One
could equally well call this a species of machine learning. Either would
justify labeling this a computer science book.) Second, it *is*
possible to produce minimal predictive models without invoking prior
distributions or coding schemes (but still making some assumptions about the
model class); these models tell us about the causal structure of what we're
observing, at least as that's reflected in the variables we look at. The
relevant complexity measure here is closely akin to, and inspired by, his
stochastic complexity, but not quite the same. (See these references.) So: not *another*
reformation of statistics (that would make at least three this century, when
one is enough for most disciplines), but a useful addition to the tool-kits of
statisticians and complexity-wallahs.

vi + 178 pp., bibliography (Rissanen being sole or joint author on a sixth of the entries), grossly inadequate index

Probability and Statistics / Self-Organization, Complexity, etc.

In print as a hardback, ISBN 9971-50-859-1, US$58. LoC QA267 R57

25 January 1999

Thanks to Margaret Alexander, Jim Crutchfield and Erik van Nimwegen