Universal Prediction Algorithms

Last update: 07 Apr 2026 22:39
First version: 16 February 2008

Given: a single time series, perhaps a very long one, from a stochastic process which is basically unknown; perhaps merely that it is stationary and ergodic.

Desired: a forecast which will converge on the best possible forecast, as the series becomes longer and longer. Or (much weaker): the best possible forecast from within a fixed class of forecasting algorithms.

A solution is called a universal prediction algorithm because it applied equally to all the processes within the class, and is not tailored to any one of them.

This has connections to information theory (via universal compression algorithms), to the problem of finding Markovian representations and inference for Markov models, and to many other topics. Presumably they could be used for bootstrapping time series.

Addendum, September 2024: This is a topic which I began studying in graduate school, then mostly left fallow except for noticing the occasional reference. But probabilistic sequence prediction is having an unexpected day of fame, and I want to think through what this literature might have to contribute...

Addendum, 29 March 2026: The Ornstein-Algoet Idea

Material written for a student project, but suitable for re-use here. This does not explain what Algoet added to Ornstein, nor some subtleties they had to deal with arising from imagining seeing only one, infinitely long sample path, whereas my student will be dealing with multiple paths.

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)} \] Suppose we observe a sequence of categorical random variables $ X_1, X_2, \ldots X_n $. There is some probability distribution which generates this sequence, which we'll call $ \Prob{ \cdot} $. We'll assume that this distribution is stationary, so \[ \Prob{X_1 = x_1} = \Prob{X_2 = x_1} = \Prob{(X_t = x_t} ~, \] and \[ \Prob{X_1=x_1, X_2=x_2} = \Prob{X_2=x_1, X_3=x_2} = \Prob{(X_t = x_1, X_{t+1}=x_2} ~ , \] and in general \[ \Prob{X_{1:h}=x_{1:h}} = \Prob{ X_{t+1:t+h}=x_{1:h}} \] for any whole numbers $h, t$. I will introduce a little over-loaded abbreviation, so $ p(k) = \Prob{X_1 = k} $, $ p(k|j) = \Prob{X_2=k | X_1=j} $, etc.

We want to make a probabilistic prediction of $ X_{n+1} $, that is, we want to guess at $ X_{n+1} $ but our guess should take the form of a probability distribution over the alphabet of possible categories. Let's call this guessed distribution $ f $, for "forecast". One natural guess is the empirical distribution observed to date: \[ f_0(k) = \frac{1}{n}\sum_{t=1}^{n}{\Indicator{X_t} = k} \] (The meaning of the subscript 0 will be clear in a moment or two.) For a stationary process, $ f_0(k) \rightarrow p(k) $ as $ n\rightarrow \infty $, by the ergodic theorem. We can also modify this estimate by some amount of smoothing, \[ \hat{f}_0(k) = \frac{\alpha_k + \sum_{t=1}^{n}{\Indicator{X_t} = k}}{n + \sum_{k}{\alpha_k}} \] where the constants $ \alpha_k > 0 $ bias the estimate to give some probability to every symbol. (Often however we just choose $ \alpha_k = c $ for all $ k $, especially $ \alpha_k = 1 $.) All of the other estimates I'll introduce in a moment can be similarly smoothed.

Now $ f_0 $ is usually not a very good prediction, because it ignores the context of the immediately preceding values. Specifically, $ X_{n} $ will usually tell us something about which values of $ X_{n+1} $ are more or less likely. Let us suppose that $ X_{n} = j $. Then define \[ I(j) \equiv \left{ t \in 1:(n-1) ~|~ X_t = j\right\} \] That is, the set of all places in the data where we have previously seen the symbol $ j $, and we know the next symbol. Then \[ f_1(k|j) \equiv \frac{\sum_{t \in I(j)}{\Indicator{X_{t+1}=k}}}{\# I(j)} \] The numerator is the number of times we saw $ j $ followed by $ k $, and the denominator is the number of times we saw $ j $ (regardless of what it was followed by). This will converge to $ p(k|j) $. (To see this, divide both the numerator and denimonator by $ n $. Then the numerator converges to $ \Prob{X_t=j, X_{t+1}=k } $ and the denominator to $ \Prob{X_t=j} $, in both cases by the ergodic theorem again.)

We can now define $ f_h $. Suppose that $ X_{n-h+1:n} = w $, for some word $ w $ of length $ h $. Then \[ I(w) \equiv \left\{ t \in h:(n-1) ~| ~ X_{t-h+1:t} = w \right\} \] and \[ f_h(k|w) \equiv \frac{\sum_{t \in I(w)}{\Indicator{X_{t+1}=k}}}{\# I(w)} \] and this, too, converges to the conditional probability $ \Prob{X_{t+1}=k|X_{t-h+1:t}=w} = p(k|w) $.

Now the difficulty is that we do not know how long a memory $ h $ we should use.

On the one hand, we generally want to use the longest memory we can get away with. Just as $ f_1 $ makes use of more informative context than $ f_0 $, $ f_2 $ uses more context than $ f_1 $, and in general $ f_{h+r} $ uses more than $ f_h $.
- In fact, let's consider conditioning on either $ w $, of length $ h $, or on $ uw $, of length $ h + r $, and ignore issues of finite data and estimation for the moment. A little algebra with the definition of conditional probability says that $ p(k|w) = \sum_{v: |v|=r}{p(k|vw) p(v|w) } $. The conditional distribution of $ X_{n+1} $ given the actual longer context $ uw $, $ p(k|uw) $, is in that sum, but it's averaged together with a bunch of other distributions, for other contexts which did not bring us to time $ n $. So the true predictive distribution, that we'd actually want, would be the distribution conditional on the infinite past. (Cf.) Conditioning on a shorter history instead mixes or blends that distribution with others, making our forecast less precise and less accurate.
On the other hand, the more often we've seen $ w $, the better $ f_h(k|w) $ will be as an estimate of $p(k|w)$. With finite data, beyond some length we're always better off using a more reliable estimate of a smoothed-out distribution than a very noisy estimate of a more targeted distribution. (If the data is small enough, we'll accept a lot of bias to kill off even more variance.)

If we knew that the process was a Markov chain, we would use $ h=1 $, because there's no point to anything larger. If we knew it was a Markov chain of higher order, we'd (presumably) use that order. But we usually don't know the order, and most processes aren't Markov chains of any finite order. We could try letting $ h=h(n) $ for some growing function, but there are theoretical results which show that there's no way of picking the growth rate independently of the process which will work for all processes. In particular, depending on the process, and on $ w $, we might or might not have enough data to make $ f_h(k|w) $ a reliable estimate of $ p(k|w) $. If $ f_h(k|w) $ is a good estimate of $ p(k|w) $, then we should, in that context, use a memory length of at least $ h $. (Maybe we should go on to consider even more context, and use memory length greater than $h$, but we shouldn't use less context.) But how are we to decide on this?

The ingenious idea of Ornstein --- and, following him, Algoet --- starts from the following observation. Suppose we divide the data into multiple parts, and re-estimate $ f_h(k|w) $ separately on every part. If each part is going to infinity, then every estimate should be close to $ p(k|w) $, and so all the estimates will be close to each other. (If each estimate is within $ \epsilon $ of the limit, then the distance between any two estimates is at most $ 2\epsilon $.) Moreover, there's no other limit which the multiple estimates can all converge on.

Turned around, if, after dividing the data into multiple parts, the estimates of $ f_h(k|w) $ from each part are all close to each other, then they are (probably) all close to the limit $ p(k|w) $. (More precisely, the probability that they are close to each other and far from the limit goes to zero as $ n\rightarrow\infty $.) So we can set a sequence of decreasing thresholds $ \epsilon_n \downarrow 0 $, and check whether estimates $ f_h(k|w) $ from different parts of the data all agree within $ \epsilon_n$ . If they do, we go on to try increasing $ h $ by 1. We stop just at the memory length where we can just manage to get agreement to within $ \epsilon_n $. We use that $ f_h(k|w) $ (taken over the whole data) as our prediction.

Paul H. Algoet
- "Universal Schemes for Prediction, Gambling, and Portfolio Selection," Annals of Probability 20 (1992): 901--941 and an important Correction, 23 (1995): 474--478
- "Universal Schemes for Learning the Best Nonlinear Predictor Given the Infinite Past and Side Information," IEEE Transactions on Information Theory 45 (1999): 1165--1185
Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games
Donald Ornstein and Benjamin Weiss, "How Sampling Reveals a Process", Annals of Probability 18 (1990): 905--930 [A truly beautiful and inspiring paper. --- The negative results here would seem to depend on their very strong notion of what it means to reconstruct a process. For instance, while in their sense it is not possible to always discriminate between two processes (unless they are Bernoulli), Ryabko and Ryabko (arxiv:0804.0510) give a consistent test to do just this for ergodic processes, not necessarily Bernoulli, by employing a weaker notion of inter-process distance.]
Vladimir Vovk, Alex Gammerman and Glenn Shafer, Algorithmic Learning in a Random World [Mini-review]

Alekh Agarwal and John C. Duchi, "The Generalization Ability of Online Algorithms for Dependent Data", arxiv:1110.2529 [If you have a learning algorithm which achieves low regret against some class of alternatives, and the data source is suitably strongly-mixing, your algorithm will generalize almost as well as the best rival.]
Pierre Alquier and Olivier Wintenberger, "Model selection and randomization for weakly dependent time series forecasting", Bernoulli 18 (2012): 883--913, arxiv:0902.2924
Imre Csiszar and Zsolt Talata, "On Rate of Convergence of Statistical Estimation of Stationary Ergodic Processes", IEEE Transactions on Information Theory 56 (2010): 3637--3641
Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
Donald S. Ornstein, "Guessing the next output of a stationary process", Israel Journal of Mathematics 30 (1978): 292--296
Maxim Raginsky, Roummel F. Marcia, Jorge Silva and Rebecca M. Willett, "Sequential Probability Assignment via Online Convex Programming Using Exponential Families" [ISIT 2009; PDF]
Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari, "Online Learning: Random Averages, Combinatorial Parameters, and Learnability", arxiv:1006.1138

Gérard Biau, Kevin Bleakley, László Györfi, György Ottucsák, "Nonparametric sequential prediction of time series", arxiv:0801.0327
Pierre Gaillard, Paul Baudin , "A consistent deterministic regression tree for non-parametric prediction of time series", arxiv:1405.1533
L. Gyorfi, G. Morvai, S. Yakowitz, "Limits to consistent on-line forecasting for ergodic time series", IEEE Transactions on Information Theory 44 (1998): 886-892 = arxiv:0712.2430
Marcus Hutter
- "Convergence and Error Bounds for Universal Prediction of Nonbinary Sequences," cs.LG/0106036
- "Convergence and Loss Bounds for Bayesian Sequence Prediction," cs.LG/0301014
- "General Loss Bounds for Universal Sequence Prediction," cs.AI/0101019
- "Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet", cs.LG/0311014
D. Jones, M. Kohler and H. Walk, "Weakly Universally Consistent Forecasting of Stationary and Ergodic Time Series", IEEE Transactions on Information Theory 58 (2012): 1191--1202
Yuri Kalnishkan, Vladimir Vovk and Michael V. Vyugin, "How many strings are easy to predict?", Information and Computation 201 (2005): 55--71 ["It is well known in the theory of Kolmogorov complexity that most strings cannot be compressed; more precisely, only exponentially few (O(2^n-m)) binary strings of length n can be compressed by m bits. This paper extends the 'incompressibility' property of Kolmogorov complexity to the 'unpredictability' property of predictive complexity. The 'unpredictability' property states that predictive complexity (defined as the loss suffered by a universal prediction algorithm working infinitely long) of most strings is close to a trivial upper bound (the loss suffered by a trivial minimax constant prediction strategy). We show that only exponentially few strings can be successfully predicted and find the base of the exponent."]
E. Meron and M. Feder, "Finite-Memory Universal Prediction of Individual Sequences", IEEE Transactions on Information Theory 50 (2004): 1506--1523
Gusztav Morvai, "Guessing the output of a stationary binary time series", arxiv:0710.3760
Gusztav Morvai, Sanjeev R. Kulkarni, Andrew B. Nobel, "Regression estimation from an individual stable sequence", Statistics 33 (1999): 99--118 = arxiv:0710.2496
Gusztáv Morvai and Benjamin Weiss
- "Forecasting for Stationary Binary Times Series", Atca Appl. Math. 79 (2003): 25--34 = arxiv:0710.5144
- "Forward Estimation for Ergodic Time Series", ann. Inst. H. Poincare Prob. Statist. 41 (2005): 859--870 arxiv:0711.3856
- "Inferring the conditional mean", Theory of Stochastic Processes 11 (2005): 112--120 = arxiv:0710.3757
- "Intermittent estimation of stationary time series", Test 13 (2004): 525--542 = arxiv:0711.0350
- "Limitations on intermittent forecasting", Statistics and Probability Letters 72 (2005): 285--290 = arxiv:0710.3773
- "On Classifying Processes", Bernoulli 11 (2005): 523--532 = arxiv:0710.3775
- "On sequential estimation and prediction for discrete time series", Stochastics and Dynamics 7 (2007): 417--437 = arxiv:0803.4332
- "Prediction for discrete time series", Probability Theory and Related Fields 132 (2005): 1--12 = arxiv:0711.0471
- "On universal estimates for binary renewal processes", Annals of Applied Probability 18 (2008): 1970--1992, arxiv:0811.2076
- "Nonparametric sequential prediction for stationary processes", Annals of Probability 39 (2011): 1137--1160
G. Morvai, S. Yakowitz, L. Gyorfi, "Nonparametric inference for ergodic, stationary time series", Annals of Statistics 24 (1996): 370--379, arxiv:0711.0367
Andrew B. Nobel, Gusztav Morvai, Sanjeev R. Kulkarni, "Density estimation from an individual numerical sequence", IEEE Transactions on Information Theory 44 (1998): 537--541, arxiv:0710.2500
Boris Ryabko, "Application of data compression methods to nonparametric estimation of characteristics of discrete-time stochastic processes", Problems of Information Transmission 43 (2003): 367--379
Boris Ryabko and Jaakko Astola
- "Prediction of Large Alphabet Processes and Its Application to Adaptive Source Coding", cs.IT/0504079
- "Universal Codes as a Basis for Time Series Testing", cs.IT/0602084
- "Universal Codes as a Basis for Nonparametric Testing of Serial Independence for Time Series", cs.IT/0506094
Daniil Ryabko
- "On Finding Predictors for Arbitrary Families of Processes", Journal of Machine Learning Research 11 (2010): 581--602 = arxiv:0912.4883
- "Sequence prediction in realizable and non-realizable cases", arxiv:1005.5603
Daniil Ryabko and Marcus Hutter, "On Sequence Prediction for Arbitrary Measures", cs.LG/0606077
Alessio Sancetta, "Universality of Bayesian Predictions", Bayesian Analysis 7 (2012): 1--36
Nathan Srebro, Karthik Sridharan, Ambuj Tewari, "On the Universality of Online Mirror Descent", arxiv:1107.4080
Hayato Takahashi, "Computational limits to nonparametric estimation for ergodic processes", IEEE Transactions on Information Theory 57 (2011): 6995--6999, arxiv:1002.1559
T. Weissman, "How to Filter an `Individual Sequence with Feedback'", IEEE Transactions on Information Theory 54 (2008): 3831--3841
S. Yakowitz, L. Gyorfi, J. Kieffer, G. Morvai, "Strongly consistent nonparametric forecasting and regression for stationary ergodic sequences", J. Multivariate Analysis 71 (1999): 24--41 = arxiv:0712.2592
Jacob Ziv, "A Universal Prediction Lemma and Applications to Universal Data Compression and Prediction", IEEE Transactions on Information Theory 47 (2001): 1528--1532