Statistical Inference for Markov and Hidden Markov Models

Last update: 21 Apr 2025 21:17
First version: 28 September 2007

I am concerned here with inferring the parameters and/or the structure of the model, not with the estimation of the hidden state in a hidden Markov model with known parameters; that problem falls under filtering.

Parameter inference (what machine learning types would call "learning") given a known, fixed structure. Determining the structure ("discovery"). Order estimation as a particular case of discovery, and of model selection. (Discovery may need its own notebook soon.)

Markov models naturally form exponential families; what classes of partially-observable Markov processes also do?

Inference for Stochastic Differential Equations (strictly speaking a sub-topic, but one with enough special features to get separate treatment)
Markov models
Mixture Models [An HMM with a finite number of hidden states is like a mixture model with serial dependence between observations...]
Prediction Processes; Markovian (and Conceivably Causal) Representations of Stochastic Processes
Statistics
Time Series
Universal Prediction Algorithms

Patrick Billingsley, Statistical Inference for Markov Chains [Foundational.]
Olivier Cappé Eric Moulines and Tobias Ryden, Inference in Hidden Markov Models [This is a superb book, treating all the main statistical problems connected with HMMs with both clarity and rigor. There is a chapter/appendix which reminds the reader who has forgotten about Markov chains on general state spaces, but remembers measure-theoretic probability very well, about their properties. (The idea that the reader of a book on HMMs may not know what the Viterbi algorithm is, but will of course recall the Hahn-Jordan decomposition, strikes me as very much a product of the French school of probability theory --- from which I have learned much! But it's not so far gone as to make the whole thing an exercise in the "general theory of processes".)]
Andrew Fraser, Hidden Markov Models and Dynamical Systems [Review: The Statistics of Moving Shadows]
Peter Guttorp, Stochastic Modelling of Scientific Data
Kevin McGoff, Sayan Mukherjee, and Natesh Pillai, "Statistical inference for dynamical systems: A review", Statistics Surveys 9 (2015): 209--252, arxiv:1204.6265

Animashree Anandkumar, Daniel Hsu, Sham M. Kakade, "A Method of Moments for Mixture Models and Hidden Markov Models", arxiv:1203.0683
Peter J. Bickel and Ya'acov Ritov, "Inference in hidden Markov models I: Local asymptotic normality in the stationary case", Bernoulli 2 (1996): 199--228 [Not sure if Part II ever appeared...]
David Blackwell and Lambert Koopmans, "On the Identifiability Problem for Functions of Finite Markov Chains", The Annals of Mathematical Statistics 28 (1957): 1011--1015 [An old, but very clear, paper on the problems presented by what we now call hidden Markov models]
Peter Bühlmann and Abraham J. Wyner, "Variable Length Markov Chains", The Annals of Statistics 27 (1999): 480--513 [Preprint available as Berkeley statistics department technical report 479]
Olivier Cappé "Online EM Algorithm for Hidden Markov Models", Journal of Computational and Graphical Statistics 20 (2011): 728--749, arxiv:0908.2359
George Cybenko and Valentino Crespi, "Learning Hidden Markov Models Using Nonnegative Matrix Factorization", IEEE Transactions on Information Theory 57 (2011): 3963--3970, arxiv:0809.4086 [Though it contains an error, at least in the preprint version, about the capacities of our CSSR algorithm --- we can get model structures right with much less data than they think, though we presented examples using more data than was strictly needed.]
Randal Douc, Eric Moulines, Jimmy Olsson, and Ramon van Handel, "Consistency of the maximum likelihood estimator for general hidden Markov models", Annals of Statistics 39 (2011): 474--513
P. Dupont, F. Denis and Y. Esposito, "Links between probabilistic automata and hidden Markov models: probability distributions, learning models and induction algorithms", Pattern Recognition 38 (2005): 1349--1371
Julian Feldman and Joe F. Hanna", "The Structure of Responses to a Sequence of Binary Events", Journal of Mathematical Psychology 3 (1966): 371--387
J. D. Foulkes, "A Class of Machines which Determine the Statistical Structure of a Sequence of Characters", pp. 66-73 of vol. 4 of the Western Electronics [Wescon] Convention Record, 1959 (Institute of Radio Engineers) [PDF]
Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, Alan S. Willsky, "A sticky HDP-HMM with application to speaker diarization", Annals of Applied Statistics 5 (2011): 1020--1056, arxiv:0905.2592
Sandro Gallo and Florencia Leonardi, "Nonparametric statistical inference for the context tree of a stationary ergodic process", Electronic Journal of Statistics 9 (2015): 2076--2098, arxiv:1411.7650
Subhashis Ghosal and Yongqiang Tang, "Bayesian Consistency for Markov Processes", Sankhya 68 (2006): 227--239
Subhashis Ghosal and Aad van der Vaart, "Convergence Rates of Posterior Distributions for Non-IID Observations", Annals of Statistics 35 (2007): 192--223, arxiv:0708.0491
Christopher C. Heyde, Quasi-Likelihood and Its Applications: A General Approach to Optimal Parameter Estimation
J. D. Kalbfleisch and J. F. Lawless, "Least-Squares Estimation of Transition Probabilities from Aggregate Data", The Canadian Journal of Statistics / La Revue Canadienne de Statistique 12 (1984): 169--182 [JSTOR]
Mladen Kolar, Le Song, Amr Ahmed, Eric P. Xing, "Estimating time-varying networks", Annals of Applied Statistics 4 (2010): 94--123, arxiv:0812.5087
Kevin McGoff, Andrew B. Nobel, "Empirical risk minimization and complexity of dynamical models", Annals of Statistics 48 (2020): 2031--2054, arxiv:1611.06173
Gusztav Morvai and Benjamin Weiss
- "Estimating the Lengths of Memory Words", IEEE Transactions on Information Theory 54 (2008): 3804--3807
- "On estimating the memory for finitarily Markovian processes", Ann. I. H. Poincaré-PR 43 (2007): 15--30 arxiv:0712.0105
- "Order estimation of Markov chains", IEEE Transactions on Information Theory 51 (2005): 1496--1497, arxiv:0711.0472
Ioannis Papageorgiou, Ioannis Kontoyiannis, "Context-tree weighting for real-valued time series: Bayesian inference with hierarchical mixture models", arxiv:2106.03023
F. Papangelou, "Large Deviations and the Bayesian Estimation of Higher-Order Markov Transition Functions ", Journal of Applied Probability 33 (1996): 18--27 [JSTOR]
Yuval Peres and Paul Shields, "Two new Markov order estimators", math.ST/0506080
David Pfau, Nicholas Bartlett and Frank Wood, "Probabilistic Deterministic Infinite Automata", NIPS 23 (2010) [PDF. Really, mixtures over finite automata with arbitrarily many states.]
George G. Roussas, Contiguity of Probability Measures: Some Applications in Statistics [Asymptotic theory of approximation, estimation and testing, for discrete-time Markov processes on fairly general state-spaces. Mini-review]
Christopher C. Strelioff, James P. Crutchfield, Alfred W. Hubler, "Inferring Markov Chains: Bayesian Estimation, Model Comparison, Entropy Rate, and Out-of-class Modeling", math.ST/0703715
Daniel R. Upper, Theory and Algorithms for Hidden Markov Models and Generalized Hidden Markov Models [Ph.D. thesis, math dept., Berkeley, 1997; online]

quite

E. Racca, F. Laio, D. Poggi and L. Ridolfi, "Test to determine the Markov order of a time series", Physical Review E 75 (2007): 011126 [The test is to linearly regress x(t+1) on x(t), x(t-1), etc., out to some finite order, and see how far back you have to go before the regression coefficients are insignificantly different from zero. This is not crazy as a first cut idea, but it's not generally valid, and in fact fails for the logistic map.]

Definitely

S. S. Melnyk, O. V. Usatenko, V. A. Yampol'skii and V. A. Golick, "Competition between Two Kinds of Correlations in Literary Texts", physics/0402042 [This has surely got to be one of the ugliest ways of parameterizing a Markov chain I have seen; it's a miracle they don't get probabilities greater than 1, if indeed they don't.]

CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified Models", arxiv:0901.1342
CRS and Kristina Lisa Klinkner, "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", cs.LG/0406011 [CSSR]
The relevant parts of my course on "Data over Space and Time", which were lectures 19 (for Markov models) and 23 (for hidden-Markov models) in the 2020 iteration.

Roberto C. Alamino, Nestor Caticha, "Online Learning in Discrete Hidden Markov Models", arxiv:0708.2377
Armen Allahverdyan, Aram Galstyan, "On Maximum a Posteriori Estimation of Hidden Markov Processes", UAI 2009, arxiv:0906.1980
Enrique E. Alvarez, "Estimation in Stationary Markov Renewal Processes, with Application to Earthquake Forecasting in Turkey", Methodology and Computing in Applied Probability 7 (2005): 119--130
Animashree Anandkumar, Kamalika Chaudhuri, Daniel Hsu, Sham M. Kakade, Le Song, Tong Zhang, "Spectral Methods for Learning Multivariate Latent Tree Structure", arxiv:1107.1283
Sofia Andersson and Tobias Rydén, "Subspace estimation and prediction methods for hidden Markov models", Annals of Statistics 37 (2009): 4131--4152
Ana Arribas-Gil, Elisabeth Gassiat and Catherine Matias, "Parameter estimation in pair hidden Markov models", math.ST/0509280
Yves F. Atchade, "Estimation of Network structures from partially observed Markov random fields", arxiv:1108.2835
A. R. Baigorri, C. R. Goncalves, P. A. A. Resende, "Markov Chain Order Estimation and Relative Entropy", arxiv:0910.0264
Alexandre Belloni, Roberto I. Oliveira, "Approximate group context tree: applications to dynamic programming and dynamic choice models", arxiv:1107.0312
Patrice Bertail and Stéphan Clémençon, "Edgeworth expansions of suitably normalized sample mean statistics for atomic Markov chains", Probability Theory and Related Fields 130 (2004): 388--414 [I need to learn more about Edgeworth expansions anyway]
Byron Boots, "Spectral Approaches to Learning Predictive Representations" [Thesis proposal, CMU, 2011]
Byron Boots, Sajid M. Siddiqi, Geoffrey J. Gordon, "Closing the Learning-Planning Loop with Predictive State Representations", arxiv:0912.2385
Jose Borges and Mark Levene, "Evaluating Variable Length Markov Chain Models for Analysis of User Web Navigation Sessions", cs.AI/0606115
J. Borwanker, G. Kallianpur and B. L. S. Prakasa Rao, "The Bernstein-von Mises Theorem for Markov Processes", The Annals of Mathematical Statistics 42 (1971): 1241--1253 [JSTOR]
Carles Bretó, Daihai He, Edward L. Ionides, Aaron A. King, "Time series analysis via mechanistic models", Annals of Applied Statistics 3 (2009): 319--348, arxiv:0802.0021
Prafulla Chandra, Andrew Thangaraj, Nived Rajaraman, "How good is Good-Turing for Markov samples?", arxiv:2102.01938
Zhiyi Chi, "Effects of statistical dependence on multiple testing under a hidden Markov model", Annals of Statistics 39 (2011): 439--473
Pavel Chigansky, "Maximum Likelihood Estimator for Hidden Markov Models in continuous time", Statistical Inference for Stochastic Processes 12 (2009): 139--163, arxiv:0707.0271
Antonio Anastasio Bruto da Costa, Pallab Dasgupta, "Learning Temporal Causal Sequence Relationships from Real-Time Time-Series", arxiv:1905.12262
Christos Dimitrakakis, "Contex models on sequences of covers", arxiv:1005.2263
C. C. Y. Dorea and L. C. Zhao, "Nonparametric Density Estimation in Hidden Markov Models", Statistical Inference for Stochastic Processes 5 (2002): 55--64
Michael DelRose, Christian Wagner, Philip Frederick, "Evidence Feed Forward Hidden Markov Model: A New Type of Hidden Markov Model", arxiv:1102.0899 [Looks like a "to be shot after a fair trial" paper from the abstract, but then, it deserves a fair trial]
Randal Douc, "Non singularity of the asymptotic Fisher information matrix in hidden Markov models", math.ST/0511631
Randal Douc and Eric Moulines, "Asymptotic properties of the maximum likelihood estimation in misspecified Hidden Markov models", arxiv:1110.0356
Thierry Dumont, "Context Tree Estimation in Variable Length Hidden Markov Models", arxiv:1109.0392
Farzad Eskandari and Mohammad R. Meshkani, "Empirical Bayes analysis of log-linear models for a generalized finite stationary Markov chain", Metrika 59 (2004): 173--191 [abstract]
Florence Forbes and Nathalie Peyrard, "Hidden Markov Random Field Model Selection Criteria Based on Mean Field-Like Approximations", IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003): 1089--1101 [PostScript preprint]
Scott D. Foster and Mark V. Bravington, "Graphical Diagnostics for Markov Models for Categorical Data", Journal of Computational and Graphical Statistics forthcoming (2011)
Halina Frydman and Peter Lakner, "Maximum likelihood estimation of hidden Markov processes", Annals of Applied Probability 13 (2003): 1296--1312
Cheng-Der Fuh, "Asymptotic operating characteristics of an optimal change point detection in hidden Markov models", Annals of Statistics 32 (2004): 2305--2339 = math.ST/0503682
Cheng-Der Fuh and Inchi Hu, "Estimation in hidden Markov models via efficient importance sampling", Bernoulli 13 (2007): 492--513, arxiv:0708.4152
Antonio Galves, Charlotte Galves, Nancy L. Garcia, Florencia Leonardi, "Context tree selection and linguistic rhythm retrieval from written texts", arxiv:0902.3619
Antonio Galves, Aurélien Garivier, Elisabeth Gassiat, "Joint estimation of intersecting context tree models", arxiv:1102.0673
Antonio Galves, Florencia Leonardi, "Exponential inequalities for empirical unbounded context trees", arxiv:0710.5900
Nancy L. Garcia, Lucas Moreira, "Stochastically Perturbed Chains of Variable Memory", arxiv:1305.5747
Steven T. Garren and Richard L. Smith, "Estimating the second largest eigenvalue of a Markov transition matrix", Bernoulli 6 (2000): 215--242
Hugo Harari-Kermadec, "Regenerative block empirical likelihood for Markov chains", arxiv:1102.3107
Daniel Hsu, Sham M. Kakade, Tong Zhang, "A Spectral Algorithm for Learning Hidden Markov Models", arxiv:0811.4413
Masashi Inoue and Naonori Ueda, "Exploitation of Unlabeled Sequences in Hidden Markov Models", IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (2003): 1570--1581
H. Ito, S.-I. Amari and K. Kobayashi, "Identifiability of Hidden Markov Information Sources and Their Minimum Degrees of Freedom", IEEE Transactions on Information Theory 38 (1992): 324--333
Dezhe Z. Jin, Alexay A. Kozhevnikov, "A compact statistical model of the song syntax in Bengalese finch", arxiv:1011.2998
Hans R. Künsch, "State Space and Hidden Markov Models", pp. 109--173 in Ole E. Barndorff-Nielsen, David R. Cox and Claudia Klüppelberg (eds.), Complex Stochastic Systems
J. Lember, A. Koloydenko, "Adjusted Viterbi training for hidden Markov models", arxiv:0709.2317
Florencia Leonardi, "Some upper bounds for the rate of convergence of penalized likelihood context tree estimators", arxiv:0701810
E. Locherbach, "Likelihood Ratio Processes for Markovian Particle Systems with Killing and Jumps", Statistical Inference for Stochastic Processes 5 (2002): 153--177
Claudio Macci, "Large Deviations for Empirical Estimators of the Stationary Distribution of a Semi-Markov Process with Finite State Space", Communications in Statistics: Theory and Methods 37 (2008): 3077--3089
A. Martin, Gadiel Seroussi, and Marcelo J. Weinberger, "Type Classes of Context Trees", IEEE Transactions on Information Theory 58 (2012): 4077--4093
Kevin McGoff, Sayan Mukherjee, Andrew Nobel, "Gibbs posterior convergence and the thermodynamic formalism", arxiv:1901.08641
Kevin McGoff, Sayan Mukherjee, Andrew Nobel, Natesh Pillai, "Consistency of maximum likelihood estimation for some dynamical systems", Annals of Statistics 43 (2015): 1--29, arxiv:1306.5603
Kevin McGoff, Andrew B. Nobel, "Variational analysis of inference from dynamical systems", arxiv:1601.05033
Philipp Metzner, Frank Noe and Christof Schutte, "Estimating the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data", Physical Review E 80 (2009): 021106 [I presume they have a good reason for not just using the delta method, and/or bootstrapping, but I'll have to read it to see what that is]
Elchanan Mossel, Sébastien Roch, "Learning nonsingular phylogenies and hidden Markov models", Annals of Applied Probability 16 (2006): 583--614, arxiv:cs.LG/0502076
Ursula U. Müller, Anton Schick, Wolfgang Wefelmeyer, "Blockwise Empirical Likelihood and Efficiency for Markov Chains", Journal of Time Series Analysis forthcoming (2025+)
Michael H. Neumann, Efstathios Paparoditis, "Goodness-of-fit tests for Markovian time series models: Central limit theory and bootstrap approximations", Bernoulli 14 (2008): 14--46, arxiv:0803.0835
Alexander O'Neill, Marcus Hutter, Wen Shao, Peter Sunehag, "Adaptive Context Tree Weighting", arxiv:1201.2056
Roberto Imbuzeiro Oliveira, "Stochastic processes with random contexts: a characterization, and adaptive estimators for the transition probabilities", arxiv:1309.2819
Enza Orlandi, Eva Loecherbach, "On the neighborhood radius estimation in Variable-neighborhood Markov Random Fields", arxiv:1002.4850
Adam Paszkiewicz, "When transition count for a Markov chains is a complete sufficient statistic", Statistics and Probability Letters 76 (2006): 757--763 [When the initial and final states are fixed, for example]
Spiridon Penev, Hanxiang Peng, Atnon Schick and Wolfgang Wefelmeyer, "Efficient estimators for functionals of Markov chains with parametric marginals", Statistics and Probability Letters 66 (2004): 335--345
Ricardo Ríos, Luis-Angel Rodríguez, "Penalized estimate of the number of states in Gaussian linear AR with Markov regime", Electronic Journal of Statistics 2 (2008): 1111--1128, arxiv:0807.2726
Amr Sadek and Nikolaos Limnios, "Nonparametric estimation of reliability and survival function for continuous-time finite Markov processes", Journal of Statistical Planning and Inference 133 (2005): 1--21
Manuel S. Santos, "Consistency properties of a simulation-based estimator for dynamic processes", Annals of Applied Probability 20 (2010): 196--213
Anton Schick and Wolfgang Wefelmeyer, "Estimating Joint Distributions of Markov Chains", Statistical Inference for Stochastic Processes 5 (2002): 1--22
Peter Sunehag, Marcus Hutter, "Consistency of Feature Markov Processes", arxiv:1007.2075
V. B. Tadic, "Analyticity, Convergence, and Convergence Rate of Recursive Maximum-Likelihood Estimation in Hidden Markov Models", IEEE Transactions on Information Theory 56 (2010): 6406--6432, arxiv:0904.4264
Zsolt Talata and Tyrone E. Duncan, "BIC Context Tree Estimation for Stationary Ergodic Processes", IEEE Transactions on Information Theory (2011): 3877--3886
Iuliana Teodorescu, "Maximum Likelihood Estimation for Markov Chains", arxiv:0905.4131
M. J. van der Heyden et al., "Testing the Order of Discrete Markov Chains Using Surrogate Data", Physica D 117 (1998): 299--313
Ramon van Hanel, "On the minimal penalty for Markov order estimation", Probability Theory and Related Fields 150 (2011): 709--738, arxiv:0908.3666
Joel Veness, Kee Siong Ng, Marcus Hutter, Michael Bowling, "Context Tree Switching", arxiv:1111.3182
Elodie Vernet, "Posterior consistency for nonparametric Hidden Markov Models with finite state space", arxiv:1311.3092
Martin J. Wainwright, "Inconsistent parameter estimation in Markov random fields: Benefits in the computation-limited setting", cs.LG/0602092
Christian H. Weiss, "Rule generation for categorical time series with Markov assumptions", Statistics and Computing 21 (2011): 1--16 [VLMMs]
James Zhao, "Hidden Markov Models with Multiple Observation Processes", arxiv:1010.1042
L. C. Zhao, C. C. Y. Dorea and C. R. Gonçalves, "On Determination of the Order of a Markov Chain", Statistical Inference for Stochastic Processes 4 (2001): 273--282
Ming-Jie Zhao and Herbert Jaeger, "Making the Error-Controlling Algorithm of Observable Operator Models Constructive", Neural Computation 21 (2009): 3460--3486