Rissanen begins with a chapter trying to justify his approach, talking about general issues of methodology and the meaning and ends of statistics. He rejects the idea of a true distribution, even of a true model, but is curiously willing (like the Bayesians) to take his data on trust. He equally rejects the idea of using probabilities to represent degrees of belief; prior distributions play a role in his system, as we'll see, but in a quite non-Bayesian way. The aim of statistical inquiry is to "learn from the data", which he takes to mean find a minimal model capable of generating the data; essentially, we are to subordinate everything to what the old empirio-critics like Ernst Mach, and Karl Pearson called "economy of thought". (He displays no awareness of this line of intellectual descent.) If this is to be more than a slogan, we need a way to gauge the size or complexity of models, which Rissanen will presently supply. If we are to agree that this is the correct approach, we need to be convinced either that minimal models are also going to be good or even optimal for the other jobs we want to do with statistics, like prediction or (to be very old-fashioned) explanation; or that those aren't jobs worth doing. Rissanen takes the latter tack, but provides little more than bluster, and conventional doubts about our ability to extrapolate from the past to the future.
The second chapter gets down to mathematical business. He first defines "codes" in general, as relations between one species of symbol (in which, say we record our data) to another (say, binary strings), and some particularly important kinds of codes. Probability distributions over symbols induce codes with certain optimality properties --- basically, the most frequently occuring symbols get encoded by the shortest strings. Equally, codes with those optimality properties define probability distributions (in which the symbols corresponding to the shortest code-strings correspond are deemed most probable).. He then goes on to define a number of measures of information and/or complexity in terms of coding schemes. The first of these is the Shannon entropy of a probability distribution, what's usually called information in communication-theory contexts --- essentially the mean length of words in the optimal code. Then comes Kolmogorov or algorithmic complexity, the length of the shortest computer program which will produce a particular string. (We've gone over both in another place.) Neither of these, particularly the latter, is satisfactory for purposes of statistical inquiry:
It has sometimes enthusiastically claimed that the algorithmic complexity provides an ideal solution to the inductive inference problem, and that "all" we need is to find an approximation to the non-computable algorithmic complexity and use the result to do prediction and the other inferences of interest. Well, this is a tall order, for there is nothing in a universal computer that helps us to find a good model of a string. In fact, if we already know the relevant properties of the string we can always write good programs for it, but we don't learn the properties by writing programs in the hope of finding short ones! The central task in inductive inference is and remains to find good models, and whether they are in the form of a computer program or models of a different kind, even non-computable [!], is quite immaterial. As a matter of fact, it is the main thesis in this text that by far the most useful models; ie, [sic] ones that we are likely to find most easily, are parametric probability models. We also suspect that in artificial intelligence, where models are to be found by a computer, the best strategy is to fit models in the probabilistic class, rather than embarking on a blind search for an approximation to the algorithmic complexity.This naturally leads to the question of how to define the complexity of a probabilistic model. Rissanen's suggestion is to first settle on a coding scheme D, which (remember) implicitly defines the probability of getting any particular data-stream. Then if we see the stream x we'll say the stochastic complexity is I(x|D) = - log P(x|D). It may be objected that this is still incomputable, for the same reasons as the algorithmic complexity. Rissanen's solution is simply to restrict the model class, to say that only certain models are allowed, and that of these we should chose the one which minimizes the description length for the given data; and this is a solvable problem.
Chapter three formulates this "minimum description length" (MDL) principle in a number of different ways (all at least asymptotically equivalent). All of them involve a class of models M, and a scheme for encoding its parameters. The coding scheme implicitly sets the probability of each member of M, and this is our prior distribution. (The Bayesian prior, by contrast, records our prejudices about the various models, including the ones "queerer than we can suppose.") Since the members of M are themselves probability distributions, each in turn defines an optimal code over possible data-streams. So the description-length of a particular data-stream x, relative to a given model m, is the description-length of x in m's code, plus the description-length of m in M's code (plus some small corrections). We can thus find the member of M which minimizes the description-length of a given data-stream, and if we have several classes of models we can find the model-class with the minimum description length. An interesting variant on this is to look at "predictive coding schemes", ones which effectively try to guess what the next item in the data-stream will be. Rissanen is tempted to adopt this as his basic criterion, but refrains, among other things because of what we might call the Permutation City problem: "[One] defect with the predictive code length is that the data must be ordered. There is of course the very best order, the one for which the predictive code length is minimized, but to find that we need to examine n! orders", n being the number of data-points we possess. Rissanen particularly stresses the ability of MDL methods to accommodate families of models with a variable number of parameters.
Chapter four applies the MDL criterion to standard problems of statistical inference, i.e., to parameter estimation and to hypothesis testing. In both cases the standard techniques, due to Jerzy Neyman and Egon Pearson (Karl's son), are derided as unnatural, and we are told to always pick the parameter-value or hypothesis-class which gives us the smallest description length. (This reproduces the Neyman-Pearson result when testing a simple hypothesis against a simple alternative, but not for more complicated cases, and we lose the freedom to set a significance level.) In general the discussion here seems very forced, and very remote from the actual uses to which frequentist statistical inference is put in the lab. (For instance, one of his objections to Neyman-Pearson hypothesis testing is that, with probability 1, every hypothesis will eventually be rejected at any confidence level we please; readers may amuse themselves by detecting the fallacy involved.) Fair's fair, however: Rissanen does castigate various Bayesian excesses, particularly idolatry of the maximum likelihood principle.
Chapters five and six apply the MDL criterion to various ways of doing linear regression, and of predicting time-series, respectively. He can't resist claiming that his techniques give more accurate results than others, but betrays his principles somewhat by doing so. In case of time-series, Rissanen leans towards a variation on predictive coding called "predictive least squares". He also actually advocates re-arranging the series so as to make prediction as easy as possible, but this ignores the coding cost of re-ordering the series. Finally, chapter seven looks at the MDL solution to various kinds of classification problem.
If Rissanen's brand of statistical inference has substantial advantages over the traditional sort, they elude me. Not only do we still need to guess at a good model class (what Neyman called picking the "class of admissible hypotheses"), we also need to guess at a good coding scheme for its members. Generally, I don't see anything in either the MDL criterion, or Rissanen's general diktat, to keep us from using totally different codes on each distinct occasion, always giving the particular data-stream we have in hand a very short length indeed. This procedure is totally useless for doing anything but summarizing, for saying "thus and so" concisely ("ecce data"?), but what short summaries we'll have! I don't of course mean that Rissanen would approve of such behavior, just that he's left himself no way to rebuke it. (Of course, with the right choice of model class, this isn't a problem, but Rissanen has no principled way of forcing us to use a decent model class.)
Rather than leading to the all-embracing reform of statistics which Rissanen envisages, his ideas seem to me to be relevant to two limited (but important) tasks. The first is statistical or "lossy" compression, the compression of ensembles, which is handled badly, if at all, by traditional techniques. (One could equally well call this a species of machine learning. Either would justify labeling this a computer science book.) Second, it is possible to produce minimal predictive models without invoking prior distributions or coding schemes (but still making some assumptions about the model class); these models tell us about the causal structure of what we're observing, at least as that's reflected in the variables we look at. The relevant complexity measure here is closely akin to, and inspired by, his stochastic complexity, but not quite the same. (See these references.) So: not another reformation of statistics (that would make at least three this century, when one is enough for most disciplines), but a useful addition to the tool-kits of statisticians and complexity-wallahs.