Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenPrediction Processes; Markovian (and Conceivably Causal) Representations of Stochastic Processes
http://bactra.org/notebooks/2022/07/18#prediction-process
\[
\newcommand{\indep}{\mathrel{\perp\llap{\perp}}}
\newcommand{\Prob}[1]{\mathrm{Pr}\left( #1 \right)}
\]
<P>This descends from the slides for a talk I've been giving, in some form,
since working on the papers that became my dissertation in the late 1990s.
<h2>0. "State"</h2>
In classical physics or dynamics, the <strong>state</strong> of a system is the
present variable which fixes all future observables. In quantum mechanics, we
back away from determinism: the state is the present variable which determines
the <em>distribution</em> of observables. We want to find states, in this
sense, for classical stochastic processes. So long as we're asking for things,
it would be nice if the state space was in some sense as small as possible, and
it would be further nice if the states themselves were well-behaved --- say,
homogeneous <a href="markov.html">Markov</a> processes, so we don't have to
know too much about probability theory.
<P>We'll try to construct such states by constructing predictions.
<h2>1. Notation</h2>
Upper-case letters are random variables, lower-case their realizations. We'll
deal with a stochastic process, \( \ldots, X_{-1}, X_0, X_1, X_2, \ldots \).
I'll abbreviate blocks with the notation \( X_{s}^{t} = (X_s, X_{s+1}, \ldots
X_{t-1}, X_t) \). The past up to and including \( t \) is \( X^t_{-\infty} \),
future is \( X_{t+1}^{\infty} \). For the moment, no assumption of
stationarity is needed or desired.
<P>(Discrete time is not strictly necessary, but cuts down on measure theory.)
<h2>2. Making a Prediction</h2>
We look at \( X^t_{-\infty} \) , and then make a guess about \(
X_{t+1}^{\infty} \). The most general guess we can make is a probability
distribution over future events, so let's try to do that unless we find our way
blocked. (We will find our way open.)
<P>When we make such a guess, we are going to attend to selected aspects of \(
X^t_{-\infty} \) (mean, variance, phase of 1st three Fourier modes, ...). This
means that our guess is a function, or formally a <strong>statistic</strong>,
of \( X^t_{-\infty} \).
<P>What's a good statistic to use?
<h2>3. Predictive Sufficiency</h2>
We appeal to <a href="information-theory.html">information theory</a>, especially
the notion of mutual information.
For any statistic \( \sigma \),
\[
I[X^{\infty}_{t+1};X_{-\infty}^t] \geq I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]
\]
(This is just the "data-processing inequality".)
\( \sigma \) is <strong>predictively sufficient</strong> iff
\[
I[X^{\infty}_{t+1};X_{-\infty}^t] = I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]
\]
Sufficient statistics, then, retain all predictive information in the data.
<P>At this point, the only sane response is "so what?", perhaps followed by
making up bizarre-looking functionals and defining statistics to
be <strong>shmufficient</strong> if they maximize those functionals. There is
however a good reason to care about sufficiency, embodied in a theorem of
Blackwell and Girshick: under any loss function, the optimal strategy can
be implemented using <em>only</em> knowledge of a sufficient statistic --- the
full data are not needed. For reasonable loss functions, the better-known
<a href="https://en.wikipedia.org/wiki/Raoâ€“Blackwell_theorem">Rao-Blackwell
theorem</a> says that strategies which use insufficient statistics can be
improved on by ones which use sufficient statistics.
<P>Switching our focus from optimizing prediction to attaining sufficiency
means we don't have to worry about particular loss functions.
<h2>4. Predictive States</h2>
<P>Here's how we can construct such sufficient statistics. (This particular
way of doing it follows Crutchfield and Young [1989], with minor modifications.)
<P>Say that two histories \( u \) and \( v \) are predictively equivalent iff
\[
\Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = u} = \Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = v}
\]
That is, two histories are predictively equivalent when they lead to the
same distribution over future events.
<P>This is clearly an equivalence relation (it's reflexive, symmetric and
transitive), so it divides the space of all histories up into equivalence
classes. In the usual symbols, \( [u] \) is the equivalence class containing
the particular history \( u \).
The statistic of interest, the <strong>predictive state</strong>, is
\[
\epsilon(x^t_{-\infty}) = [x^t_{-\infty}]
\]
That is, we just map histories to their equivalence classes. Each point in the
range of \( \epsilon \) is a state. A state is an equivalence class of
histories, and, interchangably, a distribution over future events (of the \( X
\) process).
<P>In an IID process, conditioning on the past makes no difference to the
future, so every history is equivalent to every other history, and there is
only one predictive state. In a periodic process with (minimal) period \( p
\), there are \( p \) states, one for each phase. In a Markov process, there
is <em>usually</em> a 1-1 correspondence between the states of the chain and
the predictive states. (When are they not in correspondence?)
<P>The \( \epsilon \) function induces a new stochastic process, taking values
in the predictive-state space, where
\[
S_t = \epsilon(X^t_{-\infty})
\]
<center>
<img src="prediction-process-history-space.png">
<br>Set of histories, color-coded by conditional distribution of futures
</center>
<center>
<img src="prediction-process-histories-partitioned-into-causal-states.png">
<br>Partitioning histories into predictive states
</center>
<h2>5. Optimality Properties</h2>
<h4>A. Sufficiency</h4>
<P>I promised that these would be sufficient statistics for predicting the
future from the past. That is, I asserted that
\[
I[X^{\infty}_{t+1};X^t_{-\infty}] = I[X^{\infty}_{t+1};\epsilon(X^t_{-\infty})]
\]
This is true, because
\[
\begin{eqnarray*}
\Prob{X^{\infty}_{t+1}|S_t = \epsilon(x^t_{-\infty})} & = & \int_{y \in [x^t_{-\infty}]}{\Prob{X^{\infty}_{t+1}|X^t_{-\infty}=y} \Prob{X^t_{-\infty}=y|S_t = \epsilon(x^t_{-\infty})} dy}\\
& = & \Prob{X^{\infty}_{t+1}|X^t_{-\infty}=x^t_{-\infty}}
\end{eqnarray*}
\]
<h4>B. Markov Properties I: Screening-Off</h4>
<P>Future observations are independent of the past given the predictive state,
\[
X^{\infty}_{t+1} \indep X^{t}_{-\infty} \mid S_{t}
\]
even if the process is not Markov,
\[
\newcommand{\notindep}{\mathrel{\rlap{\ \not}\indep}}
X_{t+1} \notindep X^t_{-\infty} \mid X_t ~.
\]
This is because of sufficiency:
\[
\begin{eqnarray*}
\Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}, S_{t+1} = \epsilon(x^t_{-\infty})} & = & \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}}\\
& = & \Prob{X^{\infty}_{t+1}|S_{t+1} = \epsilon(x^t_{-\infty})}
\end{eqnarray*}
\]
<h4>C. Recursive Updating/Deterministic Transitions</h4>
The predictive states themselves have recursive transitions:
\[
\epsilon(x^{t+1}_{-\infty}) = T(\epsilon(x^t_{-\infty}), x_{t+1})
\]
If all we remember of the history to time \( t \) is the predictive state, we
can make optimal predictions for the future from \( t \). We might worry,
however, that something which happens at \( t+1 \) might make some
previously-irrelevant piece of the past, which we'd forgotten, relevant again.
This recursive updating property says we needn't be concerned --- we can always
figure out the new predictive state from just the old predictive state and the
new observation.
<P>In <a href="computation.html">automata theory</a>, we'd say that the states
have "deterministic transitions" because of this property (even though there
are probabilities).
<P>(I know I said I'd skip continuous-time complications, but I feel compelled
to mention that the analogous property is that, for any $h > 0$,
\[
\epsilon(x^{t+h}_{-\infty}) = T(\epsilon(x^t_{-\infty}),x^{t+h}_{t})
\]
since there is no "next observation".)
<P>To see where the recursive transitions come from, pick any two equivalent
histories \( u \sim v \), any future event \( F \), and any single observation
\( a \). Write \( aF \) for the compound event where \( X_{t+1} = a \) and
then \( F \) happens starting at time \( t+2 \). Because \( aF \) is a perfectly legitimate future event and \( u \sim v \),
\[
\begin{eqnarray*}
\Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = v}\\
\Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = u} & = & \Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = v}
\end{eqnarray*}
\]
By by elementary conditional probability,
\[
\begin{eqnarray*}
\Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua}\Prob{X_{t+1}= a|X^t_{-\infty} = u}
& = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\Prob{X_{t+1}= a|X^t_{-\infty} = v}
\end{eqnarray*}
\]
Canceling (because \( u \sim v \),
\[
\Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua} = \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}
\]
Since the future event \( F \) was arbitrary, we've shown \( ua \sim va \).
<P>(If you know enough to worry about how this might go wrong with infinite
sample spaces, continuous time, etc., then you know enough to see how to patch
up the proof.)
<h4>D. Markov Properties, II: The Predictive States</h4>
\[
S_{t+1}^{\infty} \indep S^{t-1}_{-\infty}|S_t
\]
because
\[
S_{t+1}^{\infty} = T(S_t,X_{t}^{\infty})
\]
and
\[
X_{t}^{\infty} \indep \left\{ X^{t-1}_{-\infty}, S^{t-1}_{-\infty}\right\} | S_t
\]
One can further show that the transitions are homogeneous, i.e., for any
set \( B \)
of predictive states,
\[
\Prob{S_{t+1} \in B|S_t = s} = \Prob{S_{2} \in B|S_1 = s}
\]
This is because (by the definition of the predictive states)
\[
\Prob{X_{t+1}|S_t=s} = \Prob{X_2|S_1=s}
\]
and \( S_{t+1} = T(S_t, X_{t+1}) \) (by the recursive-updating property).
<h4>E. Minimality</h4>
A statistic is <strong>necessary</strong> if it can be calculated from --- is a
function of --- any sufficient statistic. A statistic which is both necessary
and sufficient is <strong>minimal sufficient</strong>. We have seen that \( \epsilon \) is sufficient; it is also easy to see that it is necessary, which,
to be precise, means that, for any sufficient statistic \( \eta \), there
exists a function \( g \) such that
\[
\epsilon(X^{t}_{-\infty}) = g(\eta(X^t_{-\infty}))
\]
Basically, the reason is that any partition of histories which isn't a
refinement of the predictive-state partition has to lose some predictive power.
<center>
<img src="prediction-process-non-sufficienct-partition.png">
<br>A non-sufficient partition of histories
</center>
<center>
<img src="prediction-process-effect-of-insufficiency-on-prediction.png">
<br>Effect of insufficiency on predictive distributions
</center>
<P>So there certainly can be statistics which are sufficient and not the
predictive states, provided they contain some superfluous detail:
<center>
<img src="prediction-process-sufficient-but-not-minimal.png">
<br>Sufficient, but not minimal, partition of histories
</center>
<P>There can also be coarser partitions than those of the predictive states, but they cannot be sufficient.
<center>
<img src="prediction-process-coarser-but-not-sufficient.png">
<br>Coarser than the predictive states, but not sufficient
</center>
<P>If \( \eta \) is sufficient, by the data-processing inequality of information theory,
\[
I[\epsilon(X^{t}_{-\infty}); X^{t}_{-\infty}] \leq I[\eta(X^{t}_{-\infty}); X^{t}_{-\infty}]
\]
<h4>F. Uniqueness</h4>
There is really no other minimal sufficient statistic.
If \( \eta \) is minimal, there is an \( h \) such that
\[
\eta = h(\epsilon) ~\mathrm{a.s.}
\]
but \( \epsilon = g(\eta) \) (a.s.),
so
\[
\begin{eqnarray*}
g(h(\epsilon)) & = & \epsilon\\
h(g(\eta)) & = & \eta
\end{eqnarray*}
\]
Thus, \( \epsilon \) and \( \eta \) partition histories in the same way (a.s.)
<h4>G. Minimal Stochasticity</h4>
If we have another sufficient statistic, it induces its own stochastic
process of statistic-values, say
\[
R_t = \eta(X^{t-1}_{-\infty})
\]
Then
\[
H[R_{t+1}|R_t] \geq H[S_{t+1}|S_t]
\]
Which is to say, the predictive states are the closest we can get to a deterministic model, without losing predictive power.
<h4>H. Entropy Rate</h4>
Of course, the predictive states let us calculate the Shannon entropy rate,
\[
\begin{eqnarray*}
h_1 \equiv \lim_{n\rightarrow\infty}{H[X_n|X^{n-1}_1]} & = & \lim_{n\rightarrow\infty}{H[X_n|S_n]}\\
& = & H[X_1|S_1]
\end{eqnarray*}
\]
and so do source coding.
<h2>5. Minimal Markovian Representation</h2>
Let's back up and review. The observed process \( (X_t) \) may be
non-Markovian and ugly. But it is generated from a homogeneous Markov process
\( (S_t) \). After minimization, this representation is (essentially) unique.
There can exist smaller Markovian representations, but then we always have
distributions over those states, and those distributions correspond to the
predictive states.
<h2>6. What Sort of Markov Model?</h2>
<P>The common, or garden, hidden Markov model has a strong independence
property:
\[
S_{t+1} \indep X_t|S_t
\]
But here
\[
S_{t+1} = T(S_t, X_t)
\]
This is a <strong><a href="chains-with-complete-connections.html">chain with complete connections</a></strong>, rather than an ordinary HMM.
<h2>7. Inventions</h2>
Variations on this basic scheme have been re-invented at least six independent
times over the years, in literatures ranging from philosophy of science to
machine learning. (References below.) The oldest version I know of, going
back to at least 1970, is that of the philosopher Wesley Salmon, who called the
equivalence-class construction a "statistical relevance basis" for causal
explanations. (He didn't link it to dynamics or states, but to input-output
relationships.) The most mathematically complete version came in 1975, from
the probabilist Frank Knight, covering all the measure-theoretic intricacies I
have glossed over. Crutchfield and Young gave essentially the construction I
went over above in 1989. (They called the states "causal" rather than
"predictive", which I'll return to below.) This in turn is equivalent to the
"observable operator models" of Jaeger (where the emphasis is on the function I
called \( T \)), and the predictive state representations or PSRs of Littman,
Sutton and Singh. The most recent re-invention I know of is the "sufficient
posterior representation" of Langford, Salakhutdinov and Zhang. I do not claim
this list is exhaustive and I would be interested to hear of others.
<P>(One which might qualify is Fursterberg's work on nonlinear prediction; but
if I understand his book correctly, he begins by <em>postulating</em> a limited
set of predictive states with recursive updating, and <em>shows</em> that some
stochastic processes can be predicted this way, without a general
construction. There are also connections to Lauritzen's work on
"completely" sufficient statistics for stochastic processes.)
<h4>A. How Broad Are These Results?</h4>
Knight gave the most general form of these results which I know of. In his
formulation, the observable process \( X \) just needs to take values in
a <a href="https://en.wikipedia.org/wiki/Polish_space#Lusin_spaces">Lusin
space</a>, time can be continuous, and the process can exhibit arbitrary
non-stationarities. The prediction process \( S \) is nonetheless a
homogeneous strong Markov process with deterministic updating, and in fact it
even has cadlag sample paths (in appropriate topologies on infinite-dimensional
distributions).
<h4>B. A Cousin: The Information Bottleneck</h4>
Tishby, Pereira and Bialek introduced a closely-related idea they called
the <strong>information bottleneck</strong>. This needs an input variable
\( X \) and output variable \( Y \). You get to fix \( \beta > 0 \) ,
and then find \( \eta(X) \) , the <strong>bottleneck variable</strong>, maximizing
\[
I[\eta(X);Y] - \beta I[\eta(X);X]
\]
In this optimization problem, you're willing to give up 1 bit of predictive
information in order to save \( \beta \) bits of memory about \( X \).
<P>Predictive sufficiency, as above, comes as \( \beta \rightarrow \infty \) ,
i.e., as you become unwilling to lose <em>any</em> predictive power.
<h2>8. Extensions</h2>
<h4>A. Input-Output</h4>
<P>Above, I walked through prediction processes for a single (possibly
multivariate) stochastic process, but of course the ideas can be
adapted to handle systems with inputs and outputs.
<P>Take a system which outputs the process \( X \), and is subjected to
inputs \( Y \). Its histories for both inputs and outputs \( x^t_{-\infty},
y^t_{-\infty} \) induces a certain conditional distribution of outputs \(
x_{t+1} \) for each further input \( y_{t+1} \). We define equivalence classes
over these joint histories, and can then enforce recursive updating and
minimize. The result are <em>internal</em> states of the system --- they don't
try to predict future inputs, they just predict how the system will respond to
those inputs.
<P>(I realize this is sketchy; see the Littman et. al paper on PSRs, or,
immodestly, chapter 7 of my dissertation.)
<h4>B. Space and Time</h4>
We can also extend this to processes which are extended in both space and time.
(I've not seen a really satisfying version for purely spatial processes.) That
is, we have a dynamic random field \( X(\vec{r},t) \). Call the <strong>past
cone</strong> of \( \vec{r}, t \) the set of all earlier points in space-time
which could affect \( X(\vec{r},t) \), and likewise call its <strong>future
cone</strong> the region where \( X(\vec{r},t) \) is relevant to what happens
later. We can now repeat almost all of the arguments above,
equivalence-classing past-cone configurations if they induce the same
distribution over future-cone configurations. This leads to a field \(
S(\vec{r}, t) \), which is a Markov random field enjoying minimal sufficiency,
recursive updating, etc., etc.
<center><img src="prediction-process-light-cones.png"></center>
<h2>9. Statistical Complexity</h2>
Following a hint in Grassberger, and an explicit program in Crutchfield and
Young, we can define the <strong>statistical complexity</strong> or
<strong>statistical forecasting complexity</strong> of the \( X \) process,
as
\[
C_t \equiv I[\epsilon(X^t_{-\infty});X^t_{-\infty}]
\]
(Clearly, this will be constant, \( C \), for stationary processes.)
<P>This is the amount of information about the past of the process needed for
optimal prediction of its future. When there the set of predictive states is
discrete, it equals \( H[\epsilon(X^t_{-\infty})] \), the entropy of the state.
It's equal to the expected value of the "algorithmic sophistication" (in the
sense of Gacs et al.); to the log of the period for periodic processes; and to
the log of the geometric mean recurrence time of the states, for stationary
processes. Note that all of these characterizations are properties of the \( X
\) process itself, not of any method we might be employing to calculate
predictions or to identify the process from data; they refer to the objective
dynamics, not a learning problem. An interesting (and so far as I know
unresolved) question is whether high-complexity processes
are <em>necessarily</em> harder to learn than simpler ones.
<P>(My <em>suspicion</em> is that the answer to the last question is "no", in
general, because there I can't see any obstacle to creating two distinct
high-complexity processes where one of them gives probability 1 to an event to
which the other gives probability 0, and vice versa. [That is, they are
"mutually singular" as stochastic processes.] If we <em>a priori</em> know
that one of those two must be right, learning is simply a matter of waiting and
seeing. Now there <em>might</em> be issues about how long we have to wait
here, and/or other subtleties I'm missing --- that's why this is a suspicion
and not a proof. Even if that's right, if we don't start out restricting
ourselves to two models, there might be a connection between process complexity
and learning complexity again. In particular,
if<a href="occams-razor.html">Kevin Kelly is right about how to interpret
Occam's Razor</a>, then it might well be that any generally-consistent,
non-crazy method for learning arbitrary prediction processes will take longer
to converge when the truth has a higher process complexity, just because
it will need to go through and discard [<em>aufheben</em>?] simpler,
in-sufficient prediction processes first. Someone should study this.)
<h2>10. Reconstruction</h2>
<P>Everything so far has been mere math, or mere probability. We have been
pretending that we live in a a mythological realm where the Oracle
just <em>tells</em> us the infinite-dimensional distribution of \( X \). Can
we instead do some statistics and find the states?
<P>There are two relevant senses of "find": we could try to learn within a
fixed model, or we could try to discover the right model.
<h4>A. Learning</h4>
Here the problem is: <em>Given</em> a fixed set of states and transitions
between them ( \( \epsilon, T \) ), and a realization \( x_1^n \) of the
process,
<em>estimate</em> \( \Prob{X_{t+1}=x|S_t=s} \). This is just estimation for a
stochastic process, and can be tackled by any of the usual methods. In fact,
it's easier than estimation for ordinary HMMs, because \( S_t \) is a function
of trajectory \( x_1^t \). In the case of discrete states and observations, at
least, one actually has an exponential family, and everything is ideally
tractable.
<h4>B. Discovery</h4>
Now the problem is: <em>Given</em> a realization \( x_1^n \) of the process,
<em>estimate</em> \( \epsilon, T, \Prob{X_{t+1}=x|S_t=s} \). The inspiration
for this comes from the "geometry from a time series"
or <a href="state-space-reconstruction.html">state-space reconstruction</a>
practiced with deterministic nonlinear dynamical systems. My favorite approach
(the CSSR algorithm) uses a lot of conditional independence tests, and so is
reminiscent of the "PC" algorithm for learning graphical causal models (which
is not a coincidence); Langford et al. have advocated a function-learning
approach; Pfau et al. have something which is at least a step towards a
nonparametric Bayesian procedure. I'll also give some references below
to discovery algorithms for spatio-temporal processes. Much, much more
can be done here than has been.
<h2>11. Causality</h2>
The term "causal states" was introduced by Crutchfield and Young in their 1989
invention of the construction. They are physicists, who are accustomed to
using the word "causal" much more casually than, say, statisticians. (I say
this with all due respect for two of my mentors; I owe almost everything I am
as a scientist to Crutchfield, one way or another.) Their construction, as
I hope it's clear from my sketch above, is all about probabilistic
rather than counterfactual prediction --- about selecting
sub-ensembles of naturally-occurring trajectories, not <em>making</em>
certain trajectories happen. Still, those screening-off properties
are <em>really suggestive</em>.
<h4>A. Back to Physics</h4>
<P>(What follows draws heavily on my paper with Cris Moore, so he gets the credit for everything useful.)
<P>Suppose we have a physical system with a microscopic state \( Z_t \in
\mathcal{Z} \) , and an evolution operator \( f \). Assume that these
micro-states do support counterfactuals, but also that we never get to see \(
Z_t \), and instead deal with \( X_t = \gamma(Z_t) \). The \( X_t \)
are <strong>coarse-grained</strong> or <strong>macroscopic</strong> variables.
<P>Each macrovariable gives a partition \( \Gamma \) of \( \mathcal{Z} \).
Sequences of \( X_t \) values refine \( \Gamma \):
\[
\Gamma^{(T)} = \bigwedge_{t=1}^{T}{f^{-t} \Gamma}
\]
<P>Now, if we form the predictive states, \( \epsilon \) partitions histories
of \( X \). Therefore, \( \epsilon \) will join cells of the
\( \Gamma^{(\infty)} \) partition. This in turn means that
\( \epsilon \) induces a partition \( \Delta \) of \( \mathcal{Z} \).
<em>This is a new, Markovian coarse-grained variable.</em>
<P>Now consider interventions. Manipulations which move \( z \) from one cell
of \( \Delta \) to another changes the distribution of \( X^{\infty}_{t+1} \),
so they have observable consequences. Changing \( z \) inside a cell of \(
\Delta \) <em>might</em> still make a difference, but to something else, not
the distribution of observables. \( \Delta \) is a way of saying "there must
be at least this much structure", and it must be Markovian.
<h4>B. Macro/Micro</h4>
The fullest and most rigorous effort at merging the usual interventional /
counterfactual notions of causality with the computational mechanics approach
is in the paper by Chalupka, Perona and Eberhardt on "Visual causal feature
learning", which also has a detailed account of what it means to
intervene on a macroscopic variable in a system with microscopic variables,
how to relate micro- and macro- scale notions of causality, the
distinction between a predictive partition and a properly causal partition,
and why causal partitions are almost always <em>coarsenings</em> of predictive
partitions. I strongly recommend the paper, even if you don't care about
image classification, its ostensible subject.
<h1>Summary</h1>
<ol>
<li> Your favorite stochastic process has a unique, minimal Markovian representation.
<li> This representation has nice predictive properties.
<li> This representation can also be reconstructed from a realization of the process in some cases, and a lot more could be done on these lines.
<li> The predictive, Markovian states have the right screening-off properties for causal models, even if we can't always guarantee that they're causal.
</ol>
<P>See also:
<a href="computational-mechanics.html">Computational Mechanics</a>
<ul>Recommended, more central:
<li>James P. Crutchfield and Karl Young, "Inferring Statistical Complexity", <cite>Physical Review Letters</cite> <strong>63</strong> (1989): 105--108
[Followed by many more important papers from Crutchfield and collaborators]
<li>Herbert Jaeger, "Observable Operator Models for Discrete Stochastic Time Series", <a href="http://dx.doi.org/10.1162/089976600300015411"><cite>Neural Computation</cite> <strong>12</strong> (2000): 1371--1398</a> [<a href="http://minds.jacobs-university.de/sites/default/files/uploads/papers/oom_neco00.pdf">PDF preprint</a>]
<li>Frank B. Knight [Knight has the oldest version of the full set of
ideas above for stochastic processes, emphasizing an intricate treatment of
continuous time. The 1975 paper is much easier to read than the later book,
but still presumes you have measure-theoretic probability at your fingers'
ends.]
<ul>
<li>"A Predictive View of Continuous Time Processes",
<a href="http://projecteuclid.org/euclid.aop/1176996302"><cite>Annals of Probability</cite> <strong>3</strong> (1975): 573--596</a>
<li><cite>Foundations of the Prediction Process</cite>
</ul>
<li>John Langford, Ruslan Salakhutdinov and Tong Zhang,
"Learning Nonlinear Dynamic Models", in ICML 2009, <a href="http://arxiv.org/abs/0905.3369">arxiv:0905.3369</a>
<li>Michael L. Littman, Richard S. Sutton and Satinder Singh,
"Predictive Representations of State", in <a href="http://papers.nips.cc/paper/1983-predictive-representations-of-state">NIPS 2001</a>
<li>Wesley C. Salmon
<ul>
<li><cite><a href="http://bactra.org/reviews/salmon-relevance.html">Statistical Explanation and Statistical Relevance</a></cite> [The oldest version of the equivalence-class construction I know of, though not applied to stochastic processes. This is a 1971 book whose core chapter
reprints a paper by Salmon first published in 1970, and which was (I gather
from the text) circulating in manuscript form from some point in the 1960s. I have not tracked down the earlier versions to determine exactly how far back this thread goes, so "1970" is a conservative estimate.]
<li><cite>Scientific Explanation and the Causal Structure of the World</cite> [1984. In terms of actual content, a fuller and more useful presentation of the ideas from the 1971 book, elaborated and with responses to ojbections, etc.]
</ul>
</ul>
<ul>Recommended, more peripheral:
<li>Krzysztof Chalupka, Pietro Perona, Frederick Eberhardt, "Visual Causal Feature Learning", <a href="http://arxiv.org/abs/1412.2309">arxiv:1412.2309</a>
<li>Harry Furstenberg, <cite><a href="http://bactra.org/weblog/algae-2014-10.html#furstenberg">Stationary Processes and Prediction Theory</a></cite>
<li>Georg M. Goerg
<ul>
<li>"Predictive State Smoothing (PRESS): Scalable non-parametric regression for high-dimensional data with variable selection", <a href="https://research.google/pubs/pub46141/">Technical Report, Google Research, 2017</a>
<li>"Classification using Predictive State Smoothing (PRESS): A scalable kernel classifier for high-dimensional features with variable selection", <a href="https://research.google/pubs/pub46767/">Technical report, Google Research, 2018</a>
</ul>
<li>Peter Grassberger, "Toward a Quantitative Theory of Self-Generated Complexity", <a href="https://doi.org/10.1007/BF00668821"><cite>International Journal of Theoretical Physics</cite> <strong>25</strong> (1986): 907--938</a>
<li>Steffen L. Lauritzen
<ul>
<li>"On the Interrelationships among Sufficiency, Total Sufficiency, and Some Related Concepts", preprint 8, Institute of Mathematical Statistics, University of Copenhagen, 1974 [<a href="http://www.stats.ox.ac.uk/~steffen/papers/interrelationships.pdf">PDF reprint via Prof. Lauiritzen</a>]
<li><cite><a href="https://doi.org/10.1007/978-1-4612-1023-8">Extremal Families and Systems of Sufficient Statistics</a></cite>
</ul>
<li>David Pfau, Nicholas Bartlett and Frank Wood, "Probabilistic Deterministic Infinite Automata", <a href="https://proceedings.neurips.cc/paper/2010/hash/dabd8d2ce74e782c65a973ef76fd540b-Abstract.html">pp. 1930--1938 in John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel and A. Culotta (eds.), <cite>Advances in Neural Information Processing Systems 23 [NIPS 2010]</cite></a>
<li>Naftali Tishby, Fernando C. Pereira and William Bialek, "The Information Bottleneck Method", in <cite>Proceedings of the 37th Annual Allerton Conference on
Communication, Control and Computing</cite>, <a href="http://arxiv.org/abs/physics/0004057">arxiv:physics/0004057</a>
<li>Daniel R. Upper, <cite><a href="http://csc.ucdavis.edu/~cmg/compmech/pubs/TAHMMGHMM.htm">Theory and Algorithms for Hidden Markov Models and
Generalized Hidden Markov Models</a></cite>
</ul>
<ul>Modesty forbids me to recommend:
<li>Georg M. Goerg and Cosma Rohilla Shalizi, "LICORS: Light Cone Reconstruction of States for Non-parametric Forecasting of Spatio-Temporal Systems",
<a href="http://arxiv.org/abs/1206.2398">arxiv:1206.2398</a> [<a href="http://bactra.org/weblog/988.html">Self-exposition</a>]
<li>George D. Montanez and CRS, "The LICORS Cabinet: Nonparametric Algorithms for Spatio-temporal Prediction", International Joint Conference on Neural Networks [IJCNN 2017], <a href="http://arxiv.org/abs/1506.02686">arxiv:1506.02686</a>
<li>CRS, "Optimal Nonlinear Prediction of Random Fields on Networks",
<cite>Discrete Mathematics and Theoretical Computer Science</cite>
<strong>AB(DMCS)</strong> (2003): 11--30, <a href="http://arxiv.org/abs/math.PR/0305160">arxiv:math.PR/0305160</a>
<li>CRS and James P. Crutchfield
<ul>
<li>"Computational Mechanics: Pattern and Prediction, Structure and Simplicity", <cite>Journal of Statistical Physics</cite> <strong>104</strong> (2001): 817--879, <a href="http://arxiv.org/abs/cond-mat/9907176">arxiv:cond-mat/9907176</a>
<li>"Information Bottlenecks, Causal States, and Statistical
Relevance Bases: How to Represent Relevant Information in Memoryless
Transduction", <a href="https://doi.org/10.1142/S0219525902000481"><cite>Advances in Complex Systems</cite> <strong>5</strong> (2002): 91--95</a>, <a href="http://arxiv.org/abs/nlin.AO/0006025">arxiv:nlin.AO/0006025</a>
</ul>
<li>CRS and Kristina Lisa Klinkner, "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", in UAI 2004, <a href="http://arxiv.org/abs/cs.LG/0406011">arxiv:cs.LG/0406011</a>
<li>CRS, Kristina Lisa Klinkner and Robert Haslinger, "Quantifying Self-Organization with Optimal Predictors", <a href="http://dx.doi.org/10.1103/PhysRevLett.93.118701"><cite>Physical Review Letters</cite> <strong>93</strong>
(2004): 118701</a>, <a href="http://arxiv.org/abs/nlin.AO/0409024">arxiv:nlin.AO/0409024</a>
</ul>
<ul>To read:
<li>Byron Boots, Sajid M. Siddiqi, Geoffrey J. Gordon, "Closing the Learning-Planning Loop with Predictive State Representations", <a href="http://arxiv.org/abs/0912.2385">arxiv:0912.2385</a>
<li>Nicolas Brodu, "Reconstruction of Epsilon-Machines in Predictive Frameworks and Decisional States", <a href="http://arxiv.org/abs/0902.0600">arxiv:0902.0600</a>
<li>Christophe Cuny, Dalibor Volny, "A quenched invariance principle for stationary processes", <a href="http://arxiv.org/abs/1202.4875">arxiv:1202.4875</a> ["we obtain a (new) construction of the fact that any stationary process may be seen as a functional of a Markov chain"]
<li>Ahmed Hefny, Carlton Downey, Geoffrey Gordon, "Supervised Learning for Dynamical System Learning", <a href="http://arxiv.org/abs/1505.05310">arxiv:1505.05310</a>
<li>Joseph Kelly, Jing Kong, and Georg M. Goerg, "Predictive State Propensity Subclassification (PSPS): A causal inference method for optimal data-driven propensity score stratification", <a href="https://research.google/pubs/pub49197/">Google Research Technical Report 2020</a>, forthcoming in proceedings of
<a href="https://www.cclear.cc/2022">CLeaR 2022</a> [Closing the circle, in
a way, by bringing these ideas back to estimating causal effects]
<li>Frank B. Knight, <cite><a href="http://projecteuclid.org/euclid.lnms/1215464503">Essays on the Prediction Process</a></cite>
<li>Wolfgang Löhr, Arleta Szkola, Nihat Ay, "Process Dimension of Classical and Non-Commutative Processes", <a href="http://arxiv.org/abs/1108.3984">arxiv:1108.3984</a>
<li>Katalin Marton and Paul C. Shields, "How many future measures can there be?", <a href="http://dx.doi.org/10.1017/S0143385702000123"><cite>Ergodic Theory and Dynamical Systems</cite> <strong>22</strong> (2002): 257--280</a>
<li>Jayakumar Subramanian, Amit Sinha, Raihan Seraj, Aditya Mahajan, "Approximate Information State for Approximate Planning and Reinforcement Learning in Partially Observed Systems", <a href="https://jmlr.org/papers/v23/20-1165.html"><cite>Journal of Machine Learning Research</cite> <strong>23</strong> (2022): 12</a>
<li>David Wingate and Satinder Singh Baveja, "Exponential Family Predictive Representations of State", <a href="https://papers.nips.cc/paper/3177-exponential-family-predictive-representations-of-state">NIPS 2007</a>
</ul>