## December 11, 2003

### Intermittent Finds in Complex Systems and Stuff, No. 1

IFiCSaS will be an irregular series of comments on papers I've read recently, and which feel like they have some kind of connection to my aggressively ill-defined field. I borrowed the name from John C. Baez's This Week's Finds in Mathematical Physics, from which I, like many other physicists of my generation, derived a great deal of my scientific education. This isn't going to be as regular (hence "intermittent", not "this week's"), or as focused (hence "and stuff"), or even as cohesive within each installment (because that's too much work). Whether it will be worth anyone's time, I dunno.

Oh, one more thing before we begin. Speculative, anti-reductionist and essayistic writing making broad claims is an important part of the tradition of complex systems, and I'd like to ask anyone who's looking for that to leave, right now.

Dominik Janzing and Daniel J. L. Herrmann, "Reliable and Efficient Inference of Bayesian Networks from Sparse Data by Statistical Learning Theory", cs.LG/0309015; also Pawel Wocjan, Dominik Janzing and Thomas Beth, "Required sample size for learning sparse Bayesian networks with many variables", cs.LG/0204052.

Graphical probability models, a.k.a. Bayesian networks, are a way of representing the statistical dependencies among multiple variables which facilitaties all kinds of calculations, and lets one prove probabilistic results by manipulating the graphs. (See my notebook, and the books recommended therein.) Some computer scientists call them "Bayesian networks", because they've been misinformed that any application of conditional probability is "Bayesian". (If that last sentence means nothing to you, consult Prof. Mayo.) It would be very nice to be able to learn graphical models from data, in a statistically reliable fashion. The very nice work Janzing et al. have done here is to provide bounds on the VC dimension of different classes of graphs, in terms of the number of nodes and their in-degree, i.e., the number of variables and the number of connections per variable. The lead paper uses this to give a structural risk minimization algorithm, efficient in both data-size and computational time, for learning sparse graphical models.

Dean P. Foster and H. Peyton Young, "Learning, hypothesis testing, and Nash equilibrium," Games and Economic Behavior 45 (2003): 73--96; pdf

Peyton Young will be familiar to constant readers for his work on evolutionary game theory and the self-organization of social institutions. Dean Foster does very nice work on information theory, learning and statistics. This extremely cool paper shows how agents can learn themselves into Nash equilibria, or very nearly so:

Consider a finite stage game G that is repeated infinitely often. At each time, the players have hypotheses about their opponents' repeated game strategies. They frequently test their hypotheses against the opponents' recent actions. When a hypothesis fails a test, a new one is adopted. Play is almost rational in the sense that, at each point in time, the players' strategies are e-best replies to their beliefs. We show that, at least 1-e of the time t these hypothesis testing strategies constitute an e-equilibrium of the repeated game from t on; in fact the strategies are close to being subgame perfect for long stretches of time. This approach solves the problem of learning to play equilibrium with no prior knowledge (even probabilistic knowledge) of the opponents' strategies or their payoffs.

The hypotheses have the form of assuming that the other players in the game are basing their moves on some function of the last m moves, for some fixed and finite m. This isn't right, because they're doing hypothesis-test learning, too, but in equilibrium it looks right, and near-equilibrium configurations are persistent and tend to drift into equilibrium. These are proper frequentist hypothesis tests, too, which is nice, since there is some evidence from experimental psychology (not discussed in this paper) that people in sequential prediction tasks act very much as though they were doing a kind of learning-through-testing.

Beong Soo So, "Maximized log-likelihood updating and model selection", Statistics and Probability Letters 64 (2003): 293--303

This is an interesting attempt to make better statistical sense of Jorma Rissanen's minimum description length principle, especially in its "predictive MDL" avatar. Rissanen defines the predictive description length, within a given model class, as $S(t) = \sum_{t=1}^{n}{l_t(\hat{\theta}(t-1))}$, where $l_t(\theta)$ is the negative log-likelihood of the data seen up to time $t$, under model $\theta$, and $\hat{\theta}(t-1)$ is the likelihood-maximizing value of the parameter within the class, given the data up to time $t-1$. Think of this as starting with some guess about a model, predicting one step ahead, then revising your model in light of what actually happened, and so on. The predictive description length is the sum of the log likelihoods you assigned at each step to what actually happened, so it measures the cumulative length of the encoding you'd assign at each step as you went along, with the subtlety that the code you're using gets revised (within a fixed class) at each step.

What So does is to decompose the predictive description length into two terms. One is simply the summed log-likelihoods, i.e. the accumulated coding lengths, using the parameter estimated from the total data (rather than re-estimating at each step). The other term is a penalty, equal to the sum of $S_t(\hat{\theta}_{t-1}) - S_t(\hat{\theta}_t)$. At each step, this is the improvement in coding made possible by knowing $x_t$, over and above knowing the previous values of $x$. Thus we are penalizing model classes which are hard to estimate, in the sense that they continue to show big changes in the optimal parameter value as more data arrives. So never spells out that intuition, but does show how to re-write his penalty term in terms of the derivative of the log-likelihood with respect to the parameters, and ultimately a Fisher-information-like matrix of the derivatives of the coding lengths with respect to the parameters.

Note that So gives the wrong citation for the paper of Rissanen's on which he draws, both in the abstract and in the bibliography; the correct citation is Journal of the Royal Statistical Society B 49 (1987): 223--239.

Peter Mandik and Andy Clark, "Selective Representing and World-Making", Minds and Machines 12 (2002): 383--395 [CiteSeer has a copy of the PDF]

Ernest Gellner used to say that a large part of modern philosophy consists of the "care and feeding of Cartesian demons", the creatures which threaten us with the prospect that everything we know is wrong, or at best a dream. This paper sets out to starve one species of evolutionary demon, namely the one which says that organisms evolve to represent only those aspects of the world which are relevant to their ecological niches, therefore no organism truly represents the world. This is a variant of the obviously stupid argument which David Stove identified at the heart of (all) idealism and (most) relativism, and dubbed "the Gem". In Mandik and Clark's formulation, "the only world that we represent is a world that is represented by us", therefore "it depends on being represented by us". Mandik and Clark say that can't possibly be what their opponents mean, but I think they're wrong in saying so (though it speaks highly of their courtesy). They argue (correctly) that not only is there no incompatibility between the fact of selective representation and "the realist conception of a mind-independent world", but that "the latter provides the most powerful perspective from which to motivate and understand the differing perceptual and cognitive profiles" of different organisms. They go on to note, sounding one of Clark's recurring themes, that the thesis seems particularly inapplicable to people, given out apptitude for expanding our perceptual and cognitive systems through new technologies and social interaction.

Incidentally, Clark should really update his publication list, and Mandik should revive his blog. Moreover, I am extremely skeptical of the claim, which they take as given, that a tick's representation of the world consists of "three receptor cues" (buytric acid, pressure and temperature) and " three effector cues" (dropping, running about and borrowing). The beasts could hardly reproduce, if that was the case, and in any event some of them have eyes. (A quick google turns up this handy guide to the anatomy of ixodid ticks, for instance.) The arguments don't turn on this point, however.

Jochen Bröcker and Ulrich Parlitz, "Analyzing communication schemes using methods from nonlinear filtering",Chaos 13 (2003): 195--208 [PDF]

Here's the abstract: "We investigate a certain class of communication schemes including chaotic systems. Nonlinear filtering theory is employed to obtain a representation of the optimal receiver. Using known results on the filtering process we investigate the bit error probability. It is well known that in general there is no closed form expression of the nonlinear filter. Therefore, in practice approximations are necessary for the nonlinear filter in general and the optimal receiver in particular. We obtain bounds on the approximation error using stability properties of the filter. These bounds also apply to approximations of the optimal receiver."

The basic idea of filtering, a.k.a. state estimation, goes like this. There is some state $X(t)$, of which we observe a noisy function $Y$. Assuming the dynamics of the hidden state is known, and we have a sequence of observations of $Y$, how should we estimate $X$? That is, what filter should we apply to the $Y$ signal to recover the state? In the 1940s, Norbert Wiener and Andrei Kolmogorov (independently) found the solution which gives the optimal linear, time-invariant filter. In the early 1960s, Kalman and Bucy (together) found the optimal filter for systems with linear dynamics and Gaussian noise, and later that decade Stratonovich and Kushner (independently) solved the problem of the optimal nonlinear filter, which gives not just a point-estimate (like the Wiener or Kalman filters), but the whole probability distribution of the state conditional on all previous observations. Their nonlinear filter has the very nice property that it's recursive, meaning that our estimate at time $t+1$ is a function of our estimate at time $t$ and the new observation we take at $t+1$ --- we don't need to keep around the entire previous measurement history. There's been a lot of work on this theory in probabiliy and stochastics (see e.g. this site by R. W. R Darling, who has some nice papers on the uses of differential geometry in filtering), but people in nonlinear dynamics don't attend to it much.

Bröcker and Parlitz's paper is a nice illustration of why we should: it lets us solve some tricky problems! In particular they give a very nice analysis of some of the recently-popular schemes for encoding bits into dynamical systems, which has applications in communications and especially cryptography. Despite a few lapses in English grammar, this is a very well-written and well-laid-out paper, moving from a broad overview of the problems of communication theory, and the differences between the Shannon and the Wiener approaches thereto, to nonlinear filtering and applications, and even including, as an appendix, a review of the basics of the ergodic theory of Markov processes.

Gregory L. Eyink, "A Variational Formulation of Optimal Nonlinear Estimation", physics/0011049

While on the subject of nonlinear filtering, this somewhat older paper includes a very nice introduction to the problem and the Stratonovich-Kushner solution, as well as proposing a tractable numerical approximation scheme. In addition to straight-forward applications to our problems, complex-systems wallahs should be interested in the (sound) analogies Eyink draws to non-equilibrium statistical mechanics and turbulence.

Vladimir K. Vanag and Irving R. Epstein, "Segmented spiral waves in a reaction-diffusion system", Proceedings of the National Academy of Sciences (USA) 100 (2003): 14635--14638 [link]

Abstract: "Pattern formation in reaction-diffusion systems is often invoked as a mechanism for biological morphogenesis. Patterns in chemical systems typically occur either as propagating waves or as stationary, spatially periodic, Turing structures. The spiral and concentric (target) waves found to date in spatially extended chemical or physical systems are smooth and continuous; only living systems, such as seashells, lichens, pine cones, or flowers, have been shown to demonstrate segmentation of these patterns. Here, we report observations of segmented spiral and target waves in the Belousov-Zhabotinsky reaction dispersed in water nanodroplets of a water-in-oil microemulsion. These highly ordered chemical patterns, consisting of short wave segments regularly separated by gaps, form a link between Turing and trigger wave patterns and narrow the disparity between chemistry and biology. They exhibit aspects of such fundamental biological behavior as self-replication of structural elements and preservation of morphology during evolutionary development from a simpler precursor to a more complex structure."

This is an extremely cool experimental result, and the movies published as supporting information are not to be missed.

Next time, I'll try to talk about distributed information in networks, but no promises.