## Statistical Inference for Markov and Hidden Markov Models

*11 Oct 2023 05:07*

I am concerned here with inferring the parameters and/or the structure of the model, not with the estimation of the hidden state in a hidden Markov model with known parameters; that problem falls under filtering.

Parameter inference (what machine learning types would call "learning") given a known, fixed structure. Determining the structure ("discovery"). Order estimation as a particular case of discovery, and of model selection. (Discovery may need its own notebook soon.)

Markov models naturally form exponential families; what classes of partially-observable Markov processes also do?

- See also:
- Inference for Stochastic Differential Equations (strictly speaking a sub-topic, but one with enough special features to get separate treatment)
- Markov models
- Mixture Models [An HMM with a finite number of hidden states is like a mixture model with serial dependence between observations...]
- Prediction Processes; Markovian (and Conceivably Causal) Representations of Stochastic Processes
- Statistics
- Time Series
- Universal Prediction Algorithms

- Recommended (big picture):
- Patrick Billingsley, Statistical Inference for Markov Chains [Foundational.]
- Olivier Cappé Eric Moulines and Tobias Ryden,
Inference in Hidden Markov Models [This is a superb book, treating
all the main statistical problems connected with HMMs with both clarity and
rigor. There is a chapter/appendix which reminds the reader who has forgotten
about Markov chains on general state spaces, but remembers measure-theoretic
probability
*very well*, about their properties. (The idea that the reader of a book on HMMs may not know what the Viterbi algorithm is, but will*of course*recall the Hahn-Jordan decomposition, strikes me as very much a product of the French school of probability theory --- from which I have learned much! But it's not so far gone as to make the whole thing an exercise in the "general theory of processes".)] - Andrew Fraser, Hidden Markov Models and Dynamical Systems [Review: The Statistics of Moving Shadows]
- Peter Guttorp, Stochastic Modelling of Scientific Data
- Kevin McGoff, Sayan Mukherjee, and Natesh Pillai, "Statistical inference for dynamical systems: A review", Statistics Surveys
**9**(2015): 209--252, arxiv:1204.6265

- Recommended (close-ups):
- Animashree Anandkumar, Daniel Hsu, Sham M. Kakade, "A Method of Moments for Mixture Models and Hidden Markov Models", arxiv:1203.0683
- Peter J. Bickel and Ya'acov Ritov, "Inference in hidden Markov models I: Local asymptotic normality in the stationary case", Bernoulli
**2**(1996): 199--228 [Not sure if Part II ever appeared...] - David Blackwell and Lambert Koopmans, "On the Identifiability
Problem for Functions of Finite Markov Chains", The Annals of
Mathematical Statistics
**28**(1957): 1011--1015 [An old, but very clear, paper on the problems presented by what we now call hidden Markov models] - Peter Bühlmann and Abraham J. Wyner, "Variable Length Markov
Chains", The Annals of Statistics
**27**(1999): 480--513 [Preprint available as Berkeley statistics department technical report 479] - Olivier Cappé "Online EM Algorithm for Hidden Markov
Models", Journal of Computational and Graphical Statistics
**20**(2011): 728--749, arxiv:0908.2359 - George Cybenko and Valentino Crespi, "Learning Hidden Markov Models Using
Nonnegative Matrix Factorization", IEEE
Transactions on Information Theory
**57**(2011): 3963--3970, arxiv:0809.4086 [Though it contains an error, at least in the preprint version, about the capacities of our CSSR algorithm --- we can get model structures right with much less data than they think, though we presented examples using more data than was strictly needed.] - Randal Douc, Eric Moulines, Jimmy Olsson, and Ramon van Handel, "Consistency of the maximum likelihood estimator for general hidden Markov models",
Annals of Statistics
**39**(2011): 474--513 - P. Dupont, F. Denis and Y. Esposito, "Links between probabilistic
automata and hidden Markov models: probability distributions, learning models
and induction algorithms", Pattern
Recognition
**38**(2005): 1349--1371 - Julian Feldman and Joe F. Hanna",
"The Structure of Responses to a Sequence of Binary Events", Journal of Mathematical Psychology
**3**(1966): 371--387 - J. D. Foulkes, "A Class of Machines which Determine the Statistical Structure of a Sequence of Characters", pp. 66-73 of vol. 4 of the Western Electronics [Wescon] Convention Record, 1959 (Institute of Radio Engineers) [PDF]
- Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, Alan S. Willsky, "A sticky HDP-HMM with application to speaker diarization", Annals
of Applied Statistics
**5**(2011): 1020--1056, arxiv:0905.2592 - Sandro Gallo and Florencia Leonardi, "Nonparametric statistical inference for the context tree of a stationary ergodic process",
Electronic Journal of Statistics
**9**(2015): 2076--2098, arxiv:1411.7650 - Subhashis Ghosal and Yongqiang Tang, "Bayesian Consistency for Markov Processes", Sankhya
**68**(2006): 227--239 - Subhashis Ghosal and Aad van der Vaart, "Convergence Rates of Posterior Distributions for Non-IID
Observations", Annals of Statistics
**35**(2007): 192--223, arxiv:0708.0491 - Christopher C. Heyde, Quasi-Likelihood and Its Applications: A General Approach to Optimal Parameter Estimation
- J. D. Kalbfleisch and J. F. Lawless, "Least-Squares Estimation of Transition Probabilities from Aggregate Data", The Canadian Journal of Statistics / La Revue Canadienne de Statistique
**12**(1984): 169--182 [JSTOR] - Mladen Kolar, Le Song, Amr Ahmed, Eric P. Xing, "Estimating time-varying networks", Annals of Applied Statistics
**4**(2010): 94--123, arxiv:0812.5087 - Kevin McGoff, Andrew B. Nobel, "Empirical risk minimization and complexity of dynamical models",
Annals of Statistics
**48**(2020): 2031--2054, arxiv:1611.06173 - Gusztav Morvai and Benjamin Weiss
- "Estimating the Lengths of Memory
Words", IEEE
Transactions on Information Theory
**54**(2008): 3804--3807 - "On estimating the memory for finitarily Markovian processes", Ann. I. H. Poincaré-PR
**43**(2007): 15--30 arxiv:0712.0105 - "Order estimation of Markov chains", IEEE Transactions
on Information Theory
**51**(2005): 1496--1497, arxiv:0711.0472

- "Estimating the Lengths of Memory
Words", IEEE
Transactions on Information Theory
- Ioannis Papageorgiou, Ioannis Kontoyiannis, "Context-tree weighting for real-valued time series: Bayesian inference with hierarchical mixture models", arxiv:2106.03023
- F. Papangelou, "Large Deviations and the Bayesian Estimation of Higher-Order Markov Transition Functions ", Journal of Applied Probability
**33**(1996): 18--27 [JSTOR] - Yuval Peres and Paul Shields, "Two new Markov order estimators", math.ST/0506080
- David Pfau, Nicholas Bartlett and Frank Wood, "Probabilistic
Deterministic Infinite Automata", NIPS 23 (2010) [PDF. Really,
mixtures over
*finite*automata with arbitrarily many states.] - George G. Roussas, Contiguity of Probability Measures: Some Applications in Statistics [Asymptotic theory of approximation, estimation and testing, for discrete-time Markov processes on fairly general state-spaces. Mini-review]
- Christopher C. Strelioff, James P. Crutchfield, Alfred W. Hubler, "Inferring Markov Chains: Bayesian Estimation, Model Comparison, Entropy Rate, and Out-of-class Modeling", math.ST/0703715
- Daniel R. Upper, Theory and Algorithms for Hidden Markov Models and Generalized Hidden Markov Models [Ph.D. thesis, math dept., Berkeley, 1997; online]

- Not
- E. Racca, F. Laio, D. Poggi and L. Ridolfi, "Test to determine the
Markov order of a time
series", Physical
Review E
**75**(2007): 011126 [The test is to linearly regress*x(t+1)*on*x(t)*,*x(t-1)*, etc., out to some finite order, and see how far back you have to go before the regression coefficients are insignificantly different from zero. This is not crazy as a first cut idea, but it's not generally valid, and in fact fails for the logistic map.]

*quite*recommended:

- S. S. Melnyk, O. V. Usatenko, V. A. Yampol'skii and V. A. Golick, "Competition between Two Kinds of Correlations in Literary Texts", physics/0402042 [This has surely got to be one of the ugliest ways of parameterizing a Markov chain I have seen; it's a miracle they don't get probabilities greater than 1, if indeed they don't.]

*Definitely*not recommended:

- Modesty forbids me to recommend:
- CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified Models", arxiv:0901.1342
- CRS and Kristina Lisa Klinkner, "Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences", cs.LG/0406011 [CSSR]

- To read:
- Roberto C. Alamino, Nestor Caticha, "Online Learning in Discrete Hidden Markov Models", arxiv:0708.2377
- Armen Allahverdyan, Aram Galstyan, "On Maximum a Posteriori Estimation of Hidden Markov Processes", UAI 2009, arxiv:0906.1980
- Enrique E. Alvarez, "Estimation in Stationary Markov Renewal
Processes, with Application to Earthquake Forecasting in Turkey", Methodology and
Computing in Applied Probability
**7**(2005): 119--130 - Animashree Anandkumar, Kamalika Chaudhuri, Daniel Hsu, Sham M. Kakade, Le Song, Tong Zhang, "Spectral Methods for Learning Multivariate Latent Tree Structure", arxiv:1107.1283
- Sofia Andersson and Tobias Rydén, "Subspace estimation and
prediction methods for hidden Markov
models", Annals
of Statistics
**37**(2009): 4131--4152 - Ana Arribas-Gil, Elisabeth Gassiat and Catherine Matias, "Parameter estimation in pair hidden Markov models", math.ST/0509280
- Yves F. Atchade, "Estimation of Network structures from partially observed Markov random fields", arxiv:1108.2835
- A. R. Baigorri, C. R. Goncalves, P. A. A. Resende, "Markov Chain Order Estimation and Relative Entropy", arxiv:0910.0264
- Alexandre Belloni, Roberto I. Oliveira, "Approximate group context tree: applications to dynamic programming and dynamic choice models", arxiv:1107.0312
- Patrice Bertail and Stéphan Clémençon,
"Edgeworth expansions of suitably normalized sample mean statistics for atomic
Markov chains", Probability Theory and
Related Fields
**130**(2004): 388--414 [I need to learn more about Edgeworth expansions anyway] - Byron Boots, "Spectral Approaches to Learning Predictive Representations" [Thesis proposal, CMU, 2011]
- Byron Boots, Sajid M. Siddiqi, Geoffrey J. Gordon, "Closing the Learning-Planning Loop with Predictive State Representations", arxiv:0912.2385
- Jose Borges and Mark Levene, "Evaluating Variable Length Markov Chain Models for Analysis of User Web Navigation Sessions", cs.AI/0606115
- J. Borwanker, G. Kallianpur and B. L. S. Prakasa Rao, "The
Bernstein-von Mises Theorem for Markov Processes", The Annals of
Mathematical Statistics
**42**(1971): 1241--1253 [JSTOR] - Carles Bretó, Daihai He, Edward L. Ionides, Aaron A. King, "Time series analysis via mechanistic models", Annals of Applied Statistics
**3**(2009): 319--348, arxiv:0802.0021 - Zhiyi Chi, "Effects of statistical dependence on multiple testing under a hidden Markov model", Annals of Statistics
**39**(2011): 439--473 - Pavel Chigansky, "Maximum Likelihood Estimator for Hidden Markov Models in continuous time",
Statistical Inference for Stochastic Processes
**12**(2009): 139--163, arxiv:0707.0271 - Antonio Anastasio Bruto da Costa, Pallab Dasgupta, "Learning Temporal Causal Sequence Relationships from Real-Time Time-Series", arxiv:1905.12262
- Christos Dimitrakakis, "Contex models on sequences of covers", arxiv:1005.2263
- C. C. Y. Dorea and L. C. Zhao, "Nonparametric Density Estimation
in Hidden Markov Models", Statistical Inference for Stochastic
Processes
**5**(2002): 55--64 - Michael DelRose, Christian Wagner, Philip Frederick, "Evidence Feed Forward Hidden Markov Model: A New Type of Hidden Markov Model", arxiv:1102.0899 [Looks like a "to be shot after a fair trial" paper from the abstract, but then, it deserves a fair trial]
- Randal Douc, "Non singularity of the asymptotic Fisher information matrix in hidden Markov models", math.ST/0511631
- Randal Douc and Eric Moulines, "Asymptotic properties of the maximum likelihood estimation in misspecified Hidden Markov models", arxiv:1110.0356
- Thierry Dumont, "Context Tree Estimation in Variable Length Hidden Markov Models", arxiv:1109.0392
- Farzad Eskandari and Mohammad R. Meshkani, "Empirical Bayes
analysis of log-linear models for a generalized finite stationary Markov
chain", Metrika
**59**(2004): 173--191 [abstract] - Florence Forbes and Nathalie Peyrard, "Hidden Markov Random Field
Model Selection Criteria Based on Mean Field-Like Approximations", IEEE Transactions on
Pattern Analysis and Machine Intelligence
**25**(2003): 1089--1101 [PostScript preprint] - Scott D. Foster and Mark V. Bravington, "Graphical Diagnostics for Markov Models for Categorical Data", Journal of Computational and Graphical Statistics forthcoming (2011)
- Halina Frydman and Peter Lakner, "Maximum likelihood estimation of
hidden Markov
processes", Annals of
Applied Probability
**13**(2003): 1296--1312 - Cheng-Der Fuh, "Asymptotic operating characteristics of an optimal
change point detection in hidden Markov models", Annals of
Statistics
**32**(2004): 2305--2339 = math.ST/0503682 - Cheng-Der Fuh and Inchi Hu, "Estimation in hidden Markov models via efficient importance sampling", Bernoulli
**13**(2007): 492--513, arxiv:0708.4152 - Antonio Galves, Charlotte Galves, Nancy L. Garcia, Florencia Leonardi, "Context tree selection and linguistic rhythm retrieval from written texts", arxiv:0902.3619
- Antonio Galves, Aurélien Garivier, Elisabeth Gassiat, "Joint estimation of intersecting context tree models", arxiv:1102.0673
- Antonio Galves, Florencia Leonardi, "Exponential inequalities for empirical unbounded context trees", arxiv:0710.5900
- Nancy L. Garcia, Lucas Moreira, "Stochastically Perturbed Chains of Variable Memory", arxiv:1305.5747
- Steven T. Garren and Richard L. Smith, "Estimating the second largest eigenvalue of a Markov transition matrix", Bernoulli
**6**(2000): 215--242 - Hugo Harari-Kermadec, "Regenerative block empirical likelihood for Markov chains", arxiv:1102.3107
- Daniel Hsu, Sham M. Kakade, Tong Zhang, "A Spectral Algorithm for Learning Hidden Markov Models", arxiv:0811.4413
- Masashi Inoue and Naonori Ueda, "Exploitation of Unlabeled
Sequences in Hidden Markov Models", IEEE Transactions on Pattern Analysis
and Machine Intelligence
**25**(2003): 1570--1581 - H. Ito, S.-I. Amari and K. Kobayashi, "Identifiability of Hidden
Markov Information Sources and Their Minimum Degrees of Freedom", IEEE
Transactions on Information Theory
**38**(1992): 324--333 - Dezhe Z. Jin, Alexay A. Kozhevnikov, "A compact statistical model of the song syntax in Bengalese finch", arxiv:1011.2998
- Hans R. Künsch, "State Space and Hidden Markov Models", pp. 109--173 in Ole E. Barndorff-Nielsen, David R. Cox and Claudia Klüppelberg (eds.), Complex Stochastic Systems
- J. Lember, A. Koloydenko, "Adjusted Viterbi training for hidden Markov models", arxiv:0709.2317
- Florencia Leonardi, "Some upper bounds for the rate of convergence of penalized likelihood context tree estimators", arxiv:0701810
- E. Locherbach, "Likelihood Ratio Processes for Markovian Particle
Systems with Killing and Jumps", Statistical Inference for Stochastic
Processes
**5**(2002): 153--177 - Claudio Macci, "Large Deviations for Empirical Estimators of the Stationary Distribution of a Semi-Markov Process with Finite State Space",
Communications in
Statistics: Theory and Methods
**37**(2008): 3077--3089 - A. Martin, Gadiel Seroussi, and Marcelo J. Weinberger, "Type Classes of Context Trees",
IEEE Transactions on Information Theory
**58**(2012): 4077--4093 - Kevin McGoff, Sayan Mukherjee, Andrew Nobel, "Gibbs posterior convergence and the thermodynamic formalism", arxiv:1901.08641
- Kevin McGoff, Sayan Mukherjee, Andrew Nobel, Natesh Pillai, "Consistency of maximum likelihood estimation for some dynamical systems", Annals of Statistics
**43**(2015): 1--29, arxiv:1306.5603 - Kevin McGoff, Andrew B. Nobel, "Variational analysis of inference from dynamical systems", arxiv:1601.05033
- Philipp Metzner, Frank Noe and Christof Schutte, "Estimating
the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data", Physical Review E
**80**(2009): 021106 [I presume they have a good reason for not just using the delta method, and/or bootstrapping, but I'll have to read it to see what that is] - Elchanan Mossel, Sébastien Roch, "Learning nonsingular phylogenies and hidden Markov models", Annals of Applied Probability
**16**(2006): 583--614, arxiv:cs.LG/0502076 - Michael H. Neumann, Efstathios Paparoditis, "Goodness-of-fit tests for Markovian time series models: Central limit theory and bootstrap approximations", Bernoulli
**14**(2008): 14--46, arxiv:0803.0835 - Alexander O'Neill, Marcus Hutter, Wen Shao, Peter Sunehag, "Adaptive Context Tree Weighting", arxiv:1201.2056
- Roberto Imbuzeiro Oliveira, "Stochastic processes with random contexts: a characterization, and adaptive estimators for the transition probabilities", arxiv:1309.2819
- Enza Orlandi, Eva Loecherbach, "On the neighborhood radius estimation in Variable-neighborhood Markov Random Fields", arxiv:1002.4850
- Adam Paszkiewicz, "When transition count for a Markov chains is a
complete sufficient statistic", Statistics and
Probability Letters
**76**(2006): 757--763 [When the initial and final states are fixed, for example] - Spiridon Penev, Hanxiang Peng, Atnon Schick and Wolfgang
Wefelmeyer, "Efficient estimators for functionals of Markov chains with
parametric marginals", Statistics and Probability Letters
**66**(2004): 335--345 - Ricardo Ríos, Luis-Angel Rodríguez,
"Penalized estimate of the number of states in Gaussian linear AR with Markov regime", Electronic Journal of Statistics
**2**(2008): 1111--1128, arxiv:0807.2726 - Amr Sadek and Nikolaos Limnios, "Nonparametric estimation of
reliability and survival function for continuous-time finite Markov
processes", Journal of
Statistical Planning and Inference
**133**(2005): 1--21 - Manuel S. Santos, "Consistency properties of a simulation-based estimator for dynamic processes", Annals of Applied Probability
**20**(2010): 196--213 - Anton Schick and Wolfgang Wefelmeyer, "Estimating Joint
Distributions of Markov Chains", Statistical Inference for Stochastic
Processes
**5**(2002): 1--22 - Peter Sunehag, Marcus Hutter, "Consistency of Feature Markov Processes", arxiv:1007.2075
- V. B. Tadic, "Analyticity, Convergence, and Convergence Rate of Recursive Maximum-Likelihood Estimation in Hidden Markov Models",
IEEE Transactions on Information Theory
**56**(2010): 6406--6432, arxiv:0904.4264 - Zsolt Talata and Tyrone E. Duncan, "BIC Context Tree Estimation for Stationary Ergodic Processes", IEEE Transactions on Information Theory
**(2011): 3877--3886** - Iuliana Teodorescu, "Maximum Likelihood Estimation for Markov Chains", arxiv:0905.4131
- M. J. van der Heyden et al., "Testing the Order of Discrete Markov
Chains Using Surrogate Data", Physica D
**117**(1998): 299--313 - Ramon van Hanel, "On the minimal penalty for Markov order estimation", Probability Theory and Related Fields
**150**(2011): 709--738, arxiv:0908.3666 - Joel Veness, Kee Siong Ng, Marcus Hutter, Michael Bowling, "Context Tree Switching", arxiv:1111.3182
- Elodie Vernet, "Posterior consistency for nonparametric Hidden Markov Models with finite state space", arxiv:1311.3092
- Martin J. Wainwright, "Inconsistent parameter estimation in Markov random fields: Benefits in the computation-limited setting", cs.LG/0602092
- Christian H. Weiss, "Rule generation for categorical time
series with Markov assumptions", Statistics
and Computing
**21**(2011): 1--16 [VLMMs] - James Zhao, "Hidden Markov Models with Multiple Observation Processes", arxiv:1010.1042
- L. C. Zhao, C. C. Y. Dorea and C. R. Gonçalves, "On
Determination of the Order of a Markov Chain", Statistical Inference
for Stochastic Processes
**4**(2001): 273--282 - Ming-Jie Zhao and Herbert Jaeger, "Making the Error-Controlling Algorithm of Observable Operator Models Constructive", Neural Computation
**21**(2009): 3460--3486