Notebooks

Universal Prediction Algorithms

Last update: 07 Apr 2026 22:39
First version: 16 February 2008

Given: a single time series, perhaps a very long one, from a stochastic process which is basically unknown; perhaps merely that it is stationary and ergodic.

Desired: a forecast which will converge on the best possible forecast, as the series becomes longer and longer. Or (much weaker): the best possible forecast from within a fixed class of forecasting algorithms.

A solution is called a universal prediction algorithm because it applied equally to all the processes within the class, and is not tailored to any one of them.

This has connections to information theory (via universal compression algorithms), to the problem of finding Markovian representations and inference for Markov models, and to many other topics. Presumably they could be used for bootstrapping time series.

Addendum, September 2024: This is a topic which I began studying in graduate school, then mostly left fallow except for noticing the occasional reference. But probabilistic sequence prediction is having an unexpected day of fame, and I want to think through what this literature might have to contribute...

Addendum, 29 March 2026: The Ornstein-Algoet Idea

Material written for a student project, but suitable for re-use here. This does not explain what Algoet added to Ornstein, nor some subtleties they had to deal with arising from imagining seeing only one, infinitely long sample path, whereas my student will be dealing with multiple paths.
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)} \] Suppose we observe a sequence of categorical random variables $ X_1, X_2, \ldots X_n $. There is some probability distribution which generates this sequence, which we'll call $ \Prob{ \cdot} $. We'll assume that this distribution is stationary, so \[ \Prob{X_1 = x_1} = \Prob{X_2 = x_1} = \Prob{(X_t = x_t} ~, \] and \[ \Prob{X_1=x_1, X_2=x_2} = \Prob{X_2=x_1, X_3=x_2} = \Prob{(X_t = x_1, X_{t+1}=x_2} ~ , \] and in general \[ \Prob{X_{1:h}=x_{1:h}} = \Prob{ X_{t+1:t+h}=x_{1:h}} \] for any whole numbers $h, t$. I will introduce a little over-loaded abbreviation, so $ p(k) = \Prob{X_1 = k} $, $ p(k|j) = \Prob{X_2=k | X_1=j} $, etc.

We want to make a probabilistic prediction of $ X_{n+1} $, that is, we want to guess at $ X_{n+1} $ but our guess should take the form of a probability distribution over the alphabet of possible categories. Let's call this guessed distribution $ f $, for "forecast". One natural guess is the empirical distribution observed to date: \[ f_0(k) = \frac{1}{n}\sum_{t=1}^{n}{\Indicator{X_t} = k} \] (The meaning of the subscript 0 will be clear in a moment or two.) For a stationary process, $ f_0(k) \rightarrow p(k) $ as $ n\rightarrow \infty $, by the ergodic theorem. We can also modify this estimate by some amount of smoothing, \[ \hat{f}_0(k) = \frac{\alpha_k + \sum_{t=1}^{n}{\Indicator{X_t} = k}}{n + \sum_{k}{\alpha_k}} \] where the constants $ \alpha_k > 0 $ bias the estimate to give some probability to every symbol. (Often however we just choose $ \alpha_k = c $ for all $ k $, especially $ \alpha_k = 1 $.) All of the other estimates I'll introduce in a moment can be similarly smoothed.

Now $ f_0 $ is usually not a very good prediction, because it ignores the context of the immediately preceding values. Specifically, $ X_{n} $ will usually tell us something about which values of $ X_{n+1} $ are more or less likely. Let us suppose that $ X_{n} = j $. Then define \[ I(j) \equiv \left{ t \in 1:(n-1) ~|~ X_t = j\right\} \] That is, the set of all places in the data where we have previously seen the symbol $ j $, and we know the next symbol. Then \[ f_1(k|j) \equiv \frac{\sum_{t \in I(j)}{\Indicator{X_{t+1}=k}}}{\# I(j)} \] The numerator is the number of times we saw $ j $ followed by $ k $, and the denominator is the number of times we saw $ j $ (regardless of what it was followed by). This will converge to $ p(k|j) $. (To see this, divide both the numerator and denimonator by $ n $. Then the numerator converges to $ \Prob{X_t=j, X_{t+1}=k } $ and the denominator to $ \Prob{X_t=j} $, in both cases by the ergodic theorem again.)

We can now define $ f_h $. Suppose that $ X_{n-h+1:n} = w $, for some word $ w $ of length $ h $. Then \[ I(w) \equiv \left\{ t \in h:(n-1) ~| ~ X_{t-h+1:t} = w \right\} \] and \[ f_h(k|w) \equiv \frac{\sum_{t \in I(w)}{\Indicator{X_{t+1}=k}}}{\# I(w)} \] and this, too, converges to the conditional probability $ \Prob{X_{t+1}=k|X_{t-h+1:t}=w} = p(k|w) $.

Now the difficulty is that we do not know how long a memory $ h $ we should use.

  1. On the one hand, we generally want to use the longest memory we can get away with. Just as $ f_1 $ makes use of more informative context than $ f_0 $, $ f_2 $ uses more context than $ f_1 $, and in general $ f_{h+r} $ uses more than $ f_h $.
    • In fact, let's consider conditioning on either $ w $, of length $ h $, or on $ uw $, of length $ h + r $, and ignore issues of finite data and estimation for the moment. A little algebra with the definition of conditional probability says that $ p(k|w) = \sum_{v: |v|=r}{p(k|vw) p(v|w) } $. The conditional distribution of $ X_{n+1} $ given the actual longer context $ uw $, $ p(k|uw) $, is in that sum, but it's averaged together with a bunch of other distributions, for other contexts which did not bring us to time $ n $. So the true predictive distribution, that we'd actually want, would be the distribution conditional on the infinite past. (Cf.) Conditioning on a shorter history instead mixes or blends that distribution with others, making our forecast less precise and less accurate.
  2. On the other hand, the more often we've seen $ w $, the better $ f_h(k|w) $ will be as an estimate of $p(k|w)$. With finite data, beyond some length we're always better off using a more reliable estimate of a smoothed-out distribution than a very noisy estimate of a more targeted distribution. (If the data is small enough, we'll accept a lot of bias to kill off even more variance.)

If we knew that the process was a Markov chain, we would use $ h=1 $, because there's no point to anything larger. If we knew it was a Markov chain of higher order, we'd (presumably) use that order. But we usually don't know the order, and most processes aren't Markov chains of any finite order. We could try letting $ h=h(n) $ for some growing function, but there are theoretical results which show that there's no way of picking the growth rate independently of the process which will work for all processes. In particular, depending on the process, and on $ w $, we might or might not have enough data to make $ f_h(k|w) $ a reliable estimate of $ p(k|w) $. If $ f_h(k|w) $ is a good estimate of $ p(k|w) $, then we should, in that context, use a memory length of at least $ h $. (Maybe we should go on to consider even more context, and use memory length greater than $h$, but we shouldn't use less context.) But how are we to decide on this?

The ingenious idea of Ornstein --- and, following him, Algoet --- starts from the following observation. Suppose we divide the data into multiple parts, and re-estimate $ f_h(k|w) $ separately on every part. If each part is going to infinity, then every estimate should be close to $ p(k|w) $, and so all the estimates will be close to each other. (If each estimate is within $ \epsilon $ of the limit, then the distance between any two estimates is at most $ 2\epsilon $.) Moreover, there's no other limit which the multiple estimates can all converge on.

Turned around, if, after dividing the data into multiple parts, the estimates of $ f_h(k|w) $ from each part are all close to each other, then they are (probably) all close to the limit $ p(k|w) $. (More precisely, the probability that they are close to each other and far from the limit goes to zero as $ n\rightarrow\infty $.) So we can set a sequence of decreasing thresholds $ \epsilon_n \downarrow 0 $, and check whether estimates $ f_h(k|w) $ from different parts of the data all agree within $ \epsilon_n$ . If they do, we go on to try increasing $ h $ by 1. We stop just at the memory length where we can just manage to get agreement to within $ \epsilon_n $. We use that $ f_h(k|w) $ (taken over the whole data) as our prediction.


Notebooks:   Powered by Blosxom