\documentstyle[aps,epsf]{revtex}
\newcommand{\Past} { \stackrel{\leftarrow} {S} }
\newcommand{\past} { \stackrel{\leftarrow} {s} }
\newcommand{\Future} { \stackrel{\rightarrow}{S} }
\newcommand{\future} { \stackrel{\rightarrow}{s} }
\newcommand{\PastL} { {\stackrel{\leftarrow} {s}}^L }
\newcommand{\PastBlock} { {\stackrel{\leftarrow} {S}}^L }
\newcommand{\FutureL} { {\stackrel{\rightarrow}{s}}^L }
\newcommand{\FutureBlock} { {\stackrel{\rightarrow}{S}}^L }
\newcommand{\PastLprime} { {\stackrel{\leftarrow} {s}}^{L^\prime} }
\newcommand{\FutureLprime}{ {\stackrel{\rightarrow}{s}}^{L^\prime} }
\newcommand{\AllPasts} { \stackrel{\leftarrow} {\rm {\bf S}} }
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\begin{document}
\title{Two Lectures on Computational Mechanics:\\
I. Mostly Leading Up to Causal States}
\author{Cosma Rohilla Shalizi}
\address{Physics Department, University of Wisconsin, Madison, WI 53706\\
and the Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501\\
Electronic address: shalizi@santafe.edu}
\date{18 June 1998}
\maketitle
\begin{quotation}
One man's rigor is another man's mortis.
--- Craig Bohren and Bruce Albrecht, {\it Atmospheric Thermodynamics},
introduction
\end{quotation}
This document is on-line at http://www.santafe.edu/$\sim$shalizi/comp-mech-lectures/.
\tableofcontents
\section{Information Theory}
\subsection{Entropy Defined}
Given a random variable $S: \Omega \mapsto {\cal A},$ $\Omega$ a probability
space and ${\cal A}$ a countable set, define the entropy of $S$ to be
\begin{eqnarray}
H\left[S\right] & \equiv & - \sum_{s \in {\cal A}}{P(S = s) \log{P(S = s)}}
\end{eqnarray}
with the convention that capital letters are random variables and lower-case
letters their particular values. (Notice that $H[S]$ is the expectation value
of $\log{P(S = s)}.$)
$H\left[S\right]$ is interpreted as the {\it uncertainty in $S$}, as the mean
number of yes-or-no questions needed to pick out the value of $S$ on repeated
trials, if the questions are chosen as well as possible. This is perhaps
dubious, but justified by working well.
Any well-behaved probability distribution has an associated entropy.
\subsection{Joint and Conditional Entropies}
We define the joint entropy of two variables $X$ (taking values in ${\cal A}$)
and $Y$ (taking values in ${\cal B}$) in the obvious way,
\begin{eqnarray}
H\left[X, Y\right] & \equiv & - \sum_{(x,y) \in ({\cal A} \times {\cal B})}{P(X
= x \cdot Y = y) \log{P(X = x \cdot Y = y)}}
\end{eqnarray}
and define the conditional entropy of one random variable on another from their joint
entropy,
\begin{eqnarray}
H\left[X | Y\right] & \equiv & H\left[X, Y\right] - H\left[Y\right]
\end{eqnarray}
which follows naturally from the definition of conditional probability, $P(X =
x | Y = y) \equiv {P(X = x \cdot Y = y) \over P(Y = y)}.$
We interpret $H\left[X|Y\right]$ as the uncertainty remaining in $X$ once we
know $Y.$
\subsection{Useful Inequalities}
The following inequalities and implications will prove useful. They're all
pretty intuitive, given our interpretation, and they can all be proved with
little more than straight algebra; see \cite[ch. 2]{cover-and-thomas}
\begin{eqnarray}
H\left[f(X)\right] & \leq & H\left[X\right] \\
\label{function-is-more-certain}
H\left[X, Y\right] & \geq & H\left[X\right] \\
H\left[X, Y\right] & \geq & H\left[Y\right] \\
H\left[X|Y\right] & \leq & H\left[X\right] \\
\label{conditioning-reduces-entropy}
H\left[X|Y\right] = 0 & {\rm iff} & X = f(Y) \\
\label{functions-are-certain}
H\left[X|Y\right] = H\left[X\right] & {\rm iff} & X {\rm is\ independent\ of\ }
Y \\
H\left[f(X)|Y\right] & \leq & H\left[X|Y\right] \\
H\left[X|f(Y)\right] & \geq & H\left[X|Y\right] \\
\label{conditioning-on-function}
H\left[X, Y|Z\right] & \geq & H\left[X|Z\right] \\
H\left[X|Y, Z\right] & \leq & H\left[X|Y\right] \\
\label{more-conditioning-reduces-entropy-more}
H\left[X, Y\right] & = & H\left[X\right] + H\left[Y|X\right] \\
H\left[X, Y|Z\right] & = & H\left[X|Z\right] + H\left[Y|X, Z\right]
\label{conditional-chain-rule}
\end{eqnarray}
The last two formul\ae, called the ``chain rules for entropies,'' are not of
course inequalities, but will be handy later on anyhow. (Strictly speaking,
Ineq. \ref{functions-are-certain} should have some ``except for cases of
measure 0'' language tacked on.)
\section{Patterns}
The general idea is that some object ${\cal O}$ has a pattern ${\cal P}$ (or a
pattern represented, described, captured, etc. by ${\cal P}$ --- insert your
favorite metaphysics here) iff we can use ${\cal P}$ to predict or compress
${\cal O}.$ (Ability to predict implies ability to compress, but not
vice-versa; we will mostly worry about prediction.)
\subsection{Ancestor-Worship}
This general notion goes back to Kolmogorov, who was interested in the {\it
exact} reproduction of fixed objects, in particular of binary numbers. His
candidates for ${\cal P}$ were universal Turing machine programs; in
particular, the shortest program which can produce ${\cal O}.$ (Anything which
is Turing-equivalent and where a notion of ``length'' makes good sense will do,
since we can convert from one such system to another --- say, from C to Turing
machines --- with only a finite description of the second system, and such
constants will be assimilated in a moment.)
In particular, look at the first $n$ digits of ${\cal O},$ ${\cal O}_{n},$ and
the shortest program ${\cal P}_{n}$ to produce them. What happens to the limit
\begin{equation}
\lim_{n \rightarrow \infty}{{|{\cal P}_{n}| \over n}}
\end{equation}
If there is a fixed-length program which can generate arbitrarily many digits
of the number, then this limit goes to 0. Most of our interesting numbers,
rational or irrational (e.g. $\pi, e , \sqrt{2}$) are of this sort. These
numbers are eminently compressible: the program is the compressed description,
the pattern which the sequence obeys. If the limit goes to 1, on the other
hand, we have a completely incompressible sequence.
There are many problems with the Kolmogorov complexity. It's uncomputable in
general (owing to the halting problem); it's maximal for random sequences; it
only applies to a single sequence; it makes no allowance for noise, demanding
exact reproduction.
\subsection{Patterns with Noise}
An obvious next step is to allow some noise, in exchange for
shorter descriptions. Some work along these lines has been done by
philosophers like Dennett \cite{dennett}. The intuition is that we should be
able to get very simple models of really random processes --- ``to
model coin-tossing, toss a coin.'' But this is an unstable
intermediate position, because allowing noise brings in probability, and the
natural setting for probability is ensembles.
\subsection{Patterns in Ensembles}
Here a pattern ${\cal P}$ is something which lets us predict the future of
sequences drawn from the ensemble ${\cal O}$ at better than chance rates: it
has to be statistically accurate, and confer some leverage, some advantage, as
well. Let's fix some notation, and make some assumptions that will later let
us prove neat theorems.
Let ${S}_{i}$ by a stationary stochastic process which, as before, takes values
from a countable set ${\cal A}.$ We let $i$ range over all the integers, so we
get bi-infinite sequence, which we can break at any point we choose into a
semi-infinite past, $\Past$ and a semi-infinite future $\Future.$ (The last
$L$ values of the process are $\PastBlock,$ the next $\FutureBlock.$) We want
to predict all or part of $\Future$ from some function of some part of $\Past.$
Call the set of all pasts $\AllPasts.$ Essentially our job is to divide this
up into {\it equivalence classes}, classes of pasts which are equivalent for
purposes of predicting the future. Any function on the pasts will induce such
classes --- simply say two pasts are equivalent if they give the same value for
the function --- so we'll talk indifferently about classes of pasts, indices
for classes of pasts, and states. In general, the function from histories to
states will be written $\eta(\Past)$ and will take us to a state ${\cal R}$:
${\cal R} = \eta(\Past).$ These collections of states are {\it partitions}
of $\AllPasts$ in the technical set-theoretic sense, a fact which will be
useful latter. The class of all these partitions is sometimes called ``Occam's
pool'' --- the joke will make more sense in a little bit.
We say that $\eta$ captures a pattern iff, for some $L$,
\begin{equation}
H\left[\FutureBlock | {\cal R}\right] <
H\left[\FutureBlock \right]
\end{equation}
This says that $\eta$ captures a pattern only if it tells us something about
futures of length $L$. (You can use the inequalities to show that it then
tells us something about futures of any length ${L}^{\prime} > L$.)
Weaker notions of pattern are of course possible, but the
reasons for our using this one (essentially, we can prove theorems about it)
will be apparent shortly. Note that sticking to finite-lenngth futures
means that we only have to deal with finite entropies.
Since $H\left[X|Y, Z\right] \leq H\left[X|Y\right]$
(Ineq. \ref{more-conditioning-reduces-entropy-more}), it follows that $\forall
n, H\left[\FutureBlock | {\Past}^{n}\right] \geq H\left[\FutureBlock |
{\Past}^{n+1}\right],$ and so $\forall n, H\left[\FutureBlock | {\Past}^{n}\right]
\geq H\left[\FutureBlock|\Past\right].$ Since $H\left[X|f(Y)\right] \geq
H\left[X|Y\right]$ (Ineq. \ref{conditioning-on-function}), for any $\eta,$
\begin{equation}
H\left[\FutureBlock|{\cal R}\right] = H\left[\FutureBlock|\eta(\Past)\right] \geq
H\left[\FutureBlock|\Past\right].
\label{cant-beat-the-past}
\end{equation}
That is, conditioning on the whole of the past reduces the uncertainty in the
future to as small a value as possible; but carrying the whole semi-infinite
past around is a rather bulky and uncomfortable prospect. (Put a bit
differently: We're Americans and would like to forget as much of the past as we
possibly can.)
Let's invoke Occam's razor: ``It is vain to do with more what can be done with
less.'' To use it we need to fix what is to be done, and what ``more'' and
``less'' mean. Now, the job we want done is getting $H\left[\FutureBlock|{\cal
R}\right]$ down as far as possible, for all lengths $L$ at once, to the
just-established limit of $H\left[\FutureBlock|\Past\right].$ But we want to do
this as simply as possible, with as little as possible. Now, because $P(\Past
= \past)$ is well-defined, there's an induced measure on the $\eta$-states,
i.e. $P({\cal R}=r)$ is well-defined. But this means we can calculate the
entropy of the distribution over states, $H\left[{\cal R}\right].$ This
uncertainty in our current state is the average amount of memory, in bits, that
we retain about the past; we would like to do with as little of this as
possible. (Again, this is America.) The quantity $H\left[{\cal R}\right]$ is
called the ``statistical complexity'' or ``machine complexity'' of the set of
states $\{ r \}$; so we want to minimize statistical complexity, subject to the
constraint of maximally accurate prediction. (The idea behind calling the
class of all partitions of $\AllPasts$ Occam's pool should now be clear: we
want to find the shallowest point in the pool.)
\section{Causal States}
It would be nice if we could take our (maximally accurate prediction and
minimal statistical complexity), make them axioms and show that there's only
one set of states (and so one function from histories to those states) which
satisfy them; at least, I suppose it would be nice if you like axiomatic
systems. No such proof is known. What we can prove is that states defined in
a certain way satisfy those constraints. (I suspect they're the {\it only} set
of states which satisfy those constraints, but haven't shown it yet.)
\subsection{Definition of Causal States}
We define the function $\epsilon$ from histories to classes of histories thus:
\begin{eqnarray}
\label{def-of-causal-states}
\epsilon(\past) & \equiv & \left\{ {\past}^{\prime} | \forall \future
\left(P\left(\Future = \future | \Past = \past\right) = P\left(\Future = \future
| \Past = {\past}^{\prime}\right)\right) \right\}
\end{eqnarray}
The range of $\epsilon$ consists of the {\it causal states} of the process.
Alternately and equivalently (exercise!), we could define an equivalence
relation $\sim$ such that two histories are equivalent iff they have the same
conditional distribution of futures, and then define causal states as the
equivalence classes generated by $\sim.$ Either way, the divisions of this
partition of $\AllPasts$ are made between regions which leave us in different
degrees of ignorance about the future.
\begin{figure}
\epsfxsize=2.7in
\begin{center}
\leavevmode
\epsffile{EpsilonPartition.eps}
\end{center}
\caption{A schematic representation of the partitioning of the set $\AllPasts$
of all histories into causal states $\{ {\cal S}_i \}.$ Within each causal
state all the individual histories $\Past$ have the same conditional
distribution $P(\Future | \Past)$ for future observables. Note that
the ${\cal S}_i$ need not form compact sets; we have simply drawn them that way
here for clarity.}
\label{epsilon-partition}
\end{figure}
In the statistical inference and statistical explanation literature, they'd say
that causal states are the ``statistical-relevance basis for causal
explanations,'' where the elements of the basis are maximal combinations of
independent variables with statistically different distributions for the
dependent variables. (See \cite{salmon} for the gory details.)
More colloquially: The causal states record every distinction that makes a
difference.
\subsection{Proving that Causal States Are the Funk}
I'm now going to try to convince you that causal states are the neatest thing
since sliced bread, by proving three optimality results about them and some
related lemmas. The first two results show that they satisfy the constraints
we got from Occam, of accurate prediction and minimal complexity; the third
shows that there's a sense in which they're maximally deterministic. All of
these results involve comparing causal states, generated by $\epsilon,$ with
other sets of states, generated by some other function $\eta,$ and showing that
none of these other sets of states --- none of the other patterns --- can
out-perform the causal states.
{\it More notation fixing}: ${\cal S}$ is the random variable for the current
causal state, $S_1$ is the next ``observable'' we get from the original
stochastic process, ${\cal S}^{\prime}$ is the next causal state, ${\cal R}$ is
the current state according to $\eta,$ ${\cal R}^{\prime}$ the next
$\eta$-state. Since \LaTeX doesn't have lower-case calligraphic characters,
$s$ will stand for a particular value of ${\cal S},$ $r$ for a particular value
of ${\cal R}.$ When I need to quantify over alternatives to the causal states,
I will quantify over ${\cal R}.$
\begin{theorem}[Causal States Are Maximally Prescient]
$\forall {\cal R},$ $H\left[\FutureBlock|{\cal R}\right] \geq H\left[\FutureBlock|{\cal
S}\right].$
\label{optimal-prediction-theorem}
\end{theorem}
{\it Proof.} We've already seen that $H\left[\FutureBlock|{\cal R}\right] \geq
H\left[\FutureBlock|\Past\right].$ But by construction
(Eq. \ref{def-of-causal-states}), $P(\Future = \future|\Past = \past) =
P(\Future = \future | {\cal S} = \epsilon(\past)).$ Since entropies depend
only on the probability distribution, $H\left[\FutureBlock|{\cal S}\right] =
H\left[\FutureBlock|\Past\right].$ Thus $H\left[\FutureBlock|{\cal R}\right] \geq
H\left[\FutureBlock|{\cal S}\right].$
That is to say, causal states are as good at predicting the future --- are as
{\it prescient} --- as complete histories; they satisfy the first requirement
we took from Occam.
All are subsequent results will concern rivals which are as prescient as the
causal states.
\begin{figure}
\epsfxsize=2.7in
\begin{center}
\leavevmode
\epsffile{EpsilonAndBadEtaPartition.eps}
\end{center}
\caption{An alternative set $\{ {\cal R}_i \}$ of states that partition
$\AllPasts$ overlaid on the causal states. The collection of all such
alternative partitions form Occam's ``pool''. Note again that the ${\cal R}_i$
need not be compact.}
\label{epsilon-and-bad-eta-partition}
\end{figure}
\begin{lemma}[Rivals of Causal States Are Refinements]
$\forall {\cal R},$ if $H\left[\FutureBlock|{\cal R}\right] = H\left[\FutureBlock|{\cal
S}\right],$ then $\forall r \exists s$ such that $r \subseteq s.$
\label{refinement-lemma}
\end{lemma}
{\it Proof.} We invoke a trivial extension of theorem 2.7.3 of
\cite{cover-and-thomas}: If $X_1,$ $X_2,$ \ldots $X_n$ are random variables
over the same set ${\cal A}$ with distinct probability distributions, $\Theta$
is a random variable over the integers from 1 to $n$ such that $P(\Theta = i)$
= $\lambda_i,$ and $Z$ is a random variable over ${\cal A}$ such that $Z =
{X}_{\Theta},$ then $H\left[Z\right] = H\left[\sum_{i}{\lambda_i
{X}_{i}}\right] \geq \sum_{i}{\lambda_i H\left[{X}_{i}\right]}.$ (The first
sum is purely symbolic; the second is for real.) This becomes an
equality iff all the $\lambda_i$ are either 0 or 1, since $H$ is strictly
concave, because $x \log{x}$ is strictly convex for $x \geq 0.$
Define ${\phi}_{sr} \equiv P({\cal S} = s | {\cal R} = r).$
Then
\begin{eqnarray}
H\left[\FutureBlock|{\cal R} = r\right] & = &
H\left[\sum_{s}{{\phi}_{sr}P(\FutureBlock|{\cal S} = s)}\right] \\
& \geq & \sum_{s}{{\phi}_{sr}H\left[\FutureBlock|{\cal S} = s\right]} \\
H\left[\FutureBlock|{\cal R}\right] & = & \sum_{r}{P({\cal R} = r)
H\left[\FutureBlock|{\cal R} = r\right]} \\
& \geq & \sum_{r}{P({\cal R} = r) \sum_{s}{{\phi}_{sr} H\left[\FutureBlock|{\cal S}
= s\right]}} \\
& = & \sum_{sr}{P({\cal R} = r) {\phi}_{sr} H\left[\FutureBlock|{\cal S} = s\right]}
\\
& = & \sum_{sr}{P({\cal S} = s \cdot {\cal R} = r) H\left[\FutureBlock|{\cal S} =
s\right]} \\
& = & \sum_{i}{P({\cal S} = s) H\left[\FutureBlock|{\cal S} = s\right]} \\
& = & H\left[\FutureBlock|{\cal S}\right]
\end{eqnarray}
That is to say,
\begin{eqnarray}
H\left[\FutureBlock| {\cal R}\right] & \geq & H\left[\FutureBlock | {\cal S}\right]
\end{eqnarray}
with equality if and only if every ${\phi}_{sr}$ is either 0 or 1. Thus, if
$H\left[\FutureBlock|{\cal R}\right] = H\left[\FutureBlock|{\cal S}\right],$ every $r$ is
entirely contained within some $s$ (except for possible sub-sets of measure 0).
(We cannot work the proof the other way around, and show that the causal states
have to be a refinement of the equally-prescient $\eta$-states, because
applying the theorem we stole from Cover and Thomas hinges on being able to
reduce uncertainty by specifying {\it which} distribution we're picking from.
Since the causal states are constructed so that the distribution of futures is
the same for all their sub-sets, this isn't so; Eq. \ref{def-of-causal-states}
and Theorem \ref{optimal-prediction-theorem} together protect us.)
\begin{theorem}[Causal States Have Minimal Complexity]
$\forall {\cal R},$ if $H\left[\FutureBlock|{\cal R}\right] = H\left[\FutureBlock|{\cal
S}\right],$ then $H\left[R\right] \geq H\left[S\right].$
\label{minimality-theorem}
\end{theorem}
{\it Proof.} By Lemma \ref{refinement-lemma}, every $r \subseteq$ some $s.$
This means that there is a many-to-one relation from the $\eta$-states to the
causal states, i.e., that the causal state is a function of the $\eta$-state,
${\cal S} = g({\cal R}).$ But $H\left[f(X)\right] \leq H\left[X\right]$
(Ineq. \ref{function-is-more-certain}) so $H\left[{\cal S}\right] =
H\left[g({\cal R})\right] \leq H\left[{\cal R}\right].$
Recall that earlier we defined the entropy in a set of states used for
prediction, for a pattern, to be the statistical complexity of that pattern.
We have just established that no rival pattern, which is as good at predicting
as the causal states, is any simpler than the causal states. Occam therefore
tells us that we should use the causal states.
\begin{figure}
\epsfxsize=2.7in
\begin{center}
\leavevmode
\epsffile{RefinedPartition.eps}
\end{center}
\caption{Any alternative partition $\{ {\cal R} \}$ that is as prescient as the
causal states must be a refinement of the causal-state partition. That is,
each ${\cal R}_i$ must be a (possibly improper) subset of some ${\cal S}_j.$
Otherwise, at least one ${\cal R}_i$ would have to contain parts of at least
two causal states. And so using this ${\cal R}_i$ to predict the future
observables leads to more uncertainty about $\Future$ than using the causal
states.}
\label{refined-partition}
\end{figure}
The entropy of the causal states is also conventionally written ${C}_{\mu},$
with the $\mu$ to remind us that it's a metric property, and depends on the
distribution over states.
It is here that it becomes important that we're trying to predict $\Future$ and
not just some $\FutureBlock.$ Suppose two histories $\past$ and
${\past}^{\prime}$ have the same conditional distribution for $\FutureBlock,$
but differ at some point after the next $L$ steps into the future. They would
then belong to different causal states. An $\eta$-state which merged those two
causal states, however, would have just as much ability to predict
$\FutureBlock$ as the causal states would, but would be simpler (the
uncertainty in the current state would be lower). Causal states are optimal
--- but for the hardest job.
\begin{lemma}[Causal States Are Deterministic Automata]
There exists a function $N$ such that $s^{\prime} = N (s,
{s}_{1}).$
\label{automatic-determinism-lemma}
\end{lemma}
In automata theory, a set of states is said to be {\it deterministic} if the
current state and the next input (here, the next result from the original
stochastic process) together fix the next state.
{\it Proof.} The lemma is equivalent to asserting that
$\forall {s}_{1}, \past, {\past}^{\prime},$ if $\past \sim {\past}^{\prime},$
then $\past{s}_{1} \sim {\past}^{\prime}{s}_{1},$ where $\past{s}_{1}$ is to be
understood as the semi-infinite sequence which we get from tacking ${s}_{1}$ to
the end of $\past.$ (This is a just another history, and belongs to some
causal state or other.) Suppose this were not true. Then there would have to
exist at least one future $\future$ such that
\begin{eqnarray}
P(\Future = \future | \Past = \past{s}_{1}) & \neq & P(\Future = \future |
\Past = {\past}^{\prime}{s}_{1})
\end{eqnarray}
But this would imply that
\begin{eqnarray}
P(\Future = {s}_{1}\future | \Past = \past) & \neq & P(\Future = {s}_{1}\future
| \Past = {\past}^{\prime})
\end{eqnarray}
where we read ${s}_{1}\future$ as the semi-infinite string which begins
${s}_{1}$ and continues as $\future.$ (Remember, we assumed that the point at
which we break the stochastic process into a past and a future is arbitrary.)
But this is to say that there's a future which has different probabilities
depending on whether we conditioned on $\past$ or on ${\past}^{\prime},$ which
is contrary to our assumption that the two histories belong to the same causal
state. Therefore, there is no such future $\future,$ and the alternative
statement of the lemma is true, so the lemma is true.
\begin{theorem}[Causal States Are Maximally Deterministic]
$\forall {\cal R},$ if $H\left[\FutureBlock|{\cal R}\right] = H\left[\FutureBlock|{\cal
S}\right],$ then $H\left[{\cal R}^{\prime}|{\cal R}\right] \geq H\left[{\cal
S}^{\prime}|{\cal S}\right].$
\label{determinism-theorem}
\end{theorem}
From lemma \ref{automatic-determinism-lemma}, ${\cal S}^{\prime} = N({\cal S},
{S}_{1}),$ therefore $H\left[{\cal S}^{\prime}|{\cal S}, {S}_{1}\right] = 0$
(by Ineq. \ref{functions-are-certain}). Therefore, from the chain rule for
entropies, Eq. \ref{conditional-chain-rule},
\begin{eqnarray}
H\left[{S}_{1}|{\cal S}\right] & = & H\left[{\cal S}^{\prime}, {S}_{1}|{\cal
S}\right]
\end{eqnarray}
We have no result like the lemma for the rival states ${\cal R},$ but entropies
are always non-negative, $H\left[{\cal R}^{\prime}|{\cal R}, {S}_{1}\right]
\geq 0.$ Since $H\left[\FutureBlock|{\cal R}\right] = H\left[\FutureBlock|{\cal
S}\right]$ (by hypothesis), $H\left[{S}_{1}|{\cal R}\right] =
H\left[{S}_{1}|{\cal S}\right]$ (by forming the appropriate marginal
distributions and then taking their entropies). Now we apply the chain rule
again,
\begin{eqnarray}
H\left[{\cal R}^{\prime},{S}_{1}|{\cal R}\right] & = & H\left[{S}_{1}|{\cal
R}\right] + H\left[{\cal R}^{\prime}|{S}_{1}, {\cal R}\right] \\
& \geq & H\left[{S}_{1}|{\cal R}\right] \\
& = & H\left[{S}_{1}|{\cal S}\right] \\
& = & H\left[{\cal S}^{\prime}, {S}_{1}|{\cal S}\right] \\
& = & H\left[{\cal S}^{\prime}|{\cal S}\right] + H\left[{S}_{1}|{\cal
S}^{\prime}, {\cal S}\right]
\end{eqnarray}
where in the last step we have used the chain rule once more.
Using the chain rule one last time (only with feeling),
\begin{eqnarray}
H\left[{\cal R}^{\prime},{S}_{1}|{\cal R}\right] & = & H\left[{\cal
R}^{\prime}|{\cal R}\right] + H\left[{S}_{1}|{\cal R}^{\prime}, {\cal
R}\right].
\end{eqnarray}
Putting these two expansions together, we get
\begin{eqnarray}
H\left[{\cal R}^{\prime}|{\cal R}\right] + H\left[{S}_{1}|{\cal R}^{\prime},
{\cal R}\right] & \geq & H\left[{\cal S}^{\prime}|{\cal S}\right] +
H\left[{S}_{1}|{\cal S}^{\prime}, {\cal S}\right] \\
H\left[{\cal R}^{\prime}|{\cal R}\right] - H\left[{\cal S}^{\prime}|{\cal
S}\right] & \geq & H\left[{S}_{1}|{\cal S}^{\prime}, {\cal S}\right] -
H\left[{S}_{1}|{\cal R}^{\prime}, {\cal R}\right]
\end{eqnarray}
From lemma \ref{refinement-lemma}, we know that ${\cal S} = g({\cal R}),$ which
means there's another function ${g}^{\prime}$ from ordered pairs of
$\eta$-states to ordered pairs of causal states, $({\cal S}^{\prime}, {\cal S})
= {g}^{\prime}({\cal R}^{\prime}, {\cal R}).$ Therefore, inequality
\ref{conditioning-on-function} implies $H\left[{S}_{1}|{\cal S}^{\prime}, {\cal
S}\right] \geq H\left[{S}_{1}|{\cal R}^{\prime}, {\cal R}\right]$ and so
\begin{eqnarray}
H\left[{S}_{1}|{\cal S}^{\prime}, {\cal S}\right] - H\left[{S}_{1}|{\cal
R}^{\prime}, {\cal R}\right] & \geq & 0 \\
H\left[{\cal R}^{\prime}|{\cal R}\right] - H\left[{\cal S}^{\prime}|{\cal
S}\right] & \geq & 0 \\
H\left[{\cal R}^{\prime}|{\cal R}\right] & \geq & H\left[{\cal
S}^{\prime}|{\cal S}\right]
\end{eqnarray}
Q.E.D.
What the theorem says is that there is no more uncertainty about the next
causal state, given the current causal state, then there is about the next
state given the current state for any other set of states which are as
prescient. Or, perhaps slightly less of a mouthful: the causal states approach
as closely to perfect determinism (in the usual sense!) as any rival which is
as good at predicting the future.
\subsection{Some Restrictions May Apply\ldots}
Let's catalogue all the restrictive assumptions we've made so far.
\begin{enumerate}
\item The observed process takes on discrete values.
\item The process is discrete in time.
\item The process is a pure time-series, without spatial extension.
\item The observed process is stationary.
\item Prediction can only be based on the past of the process, not on any
outside source of information.
\end{enumerate}
Can any of these be relaxed without much trouble?
The first probably can; the information-theoretic quantities we've been
using are defined for continuous random variables; somebody (probably me\ldots)
will just have to grind through the math and make sure everything checks out.
The second also looks solvable, since there's a lot of math on continuous
stochastic processes, but it may involve some funky probability theory or
functional analysis. There are already tricks to make spatially extended
systems look like time-series (essentially, one looks at all the paths
through space-time, treating each one like a time-series), which the lecture
on CAs (if it happens) will cover.
We don't know how to relax the assumption of stationarity.
Finally, I'd say that the last restriction is a {\it feature} when it
comes to thinking about patterns and the intrinsic structure of a process.
``Pattern'' is a vague word of course, but I {\it think} it's only supposed
to involve things {\it inside} the process, not the rest of the universe.
In any case, if we relax that assumption, lots of troubling sources of
information present themselves. Imagine that one Sunday afternoon you wander
over to your friend's house and find him watch a ball game on TV. He offers
you a bet on the next play, which you take and lose; and he keeps on offering
you bets all through the game, which you keep on losing. Is he unnaturally
lucky or skilled? No; the game was really broadcast two hours earlier and
you are watching a videotape. Your friend can obtain quite remarkable
accuracy by using this source of information outside the current process;
but that's cheating, and doesn't tell us very much about the pattern of the
game. But this scruple is only appropriate if what you care about is the
pattern; if you just want the best prediction you can possibly get, watch the
videotape!
\begin{thebibliography}{99}
\bibitem{cover-and-thomas} Thomas M. Cover and Joy A. Thomas (1991), {\it
Elements of Information Theory.} New York: Wiley.
\bibitem{dennett} Daniel C. Dennett (1991), ``Real Patterns,'' {\it Journal of
Philosophy} {\bf 88 (1)}, 27--51. Reprinted in Daniel Dennett, {\it
Brainchildren: Essays on Designing Minds}, Cambridge, Massachusetts: MIT Press,
1997.
\bibitem{salmon} Wesley C. Salmon (1984), {\it Scientific Explanation and the
Causal Structure of the World}. Princeton, New Jersey: Princeton University
Press.
\end{thebibliography}
\end{document}