Sequential Decision-Making Under Stochastic Uncertainty

Last update: 21 Apr 2025 21:17
First version:

Yet Another Inadequate Placeholder.

That said... I'm interested in the theory of optimal decision-making, when you need to make multiple decisions over time, and there is non-trivial stochastic uncertainty, either because the effects of your actions are somewhat random, or because you can only coarsely and noisily measure the state of the system you're acting on. I am particularly interested in the extent to which optimal strategies can be learned, in the usual "probably approximately correct" sense of computational learning theory. Here it seems that there is a potentially very important difference between trying to learn an optimal strategy on the basis of merely historical, haphazard data, versus actually performing experiments. In fact, in some sense the best way to learn about the optimal policy may be to experiment with a totally random policy, because the data you gather from such an experiment is totally free of outside, confounding factors. (Similarly, one way to learn about the properties of a nonlinear system is to measure its response to white noise; this is the Wiener method for transducers.)

Finding the optimal strategy turns out to be a very hard problem, both computationally and statistically, and it seems staggeringly unlikely that most human beings, when faced with such situations, respond in anything like the optimal manner. (This is part of the reason things like DSGE models in macroeconomics are crazy.) Or, rather, if we do act optimally, it's with respect to a non-obvious criterion.

Related or subsidiary topics which will also show up here: Partially-observable Markov decision processes, reinforcement learning, etc., etc.

People sometimes distinguish between "risk", which can be represented stochastically, i.e., as a probability distribution, and "uncertainty", where there is simply no basis for assessing frequencies or the like. Decision-making under such strong or genuine uncertainty is a very different problem, better addressed using the tools of low-regret learning.

David Blackwell and M. A. Girshick, Theory of Games and Statistical Decisions
Bent Jesper Christensen and Nicholas M. Kiefer, Economic Modeling and Inference [Review: An Optimal Path to a Dead End.]
Benjamin Recht, An Outsider's Tour of Reinforcement Learning
Richard S. Sutton, Reinforcement Learning FAQ
Richard S. Sutton and Andrew G. Barto, Reinforcement Learning

Paul H. Algoet [Algoet did some brilliant stuff in the general area of information-theoretic approaches to stochastic processes in the early and mid 1990s, and then just stopped publishing. Whatever happened to him?]
- "Universal Schemes for Prediction, Gambling, and Portfolio Selection," Annals of Probability 20 (1992): 901--941 and an important Correction, 23 (1995): 474--478
- "The Strong Law of Large Numbers for Sequential Decisions Under Uncertainty," IEEE Transactions on Information Theory 40 (1994): 609--633
Doron Blatt, Susan A. Murphy and Ji Zhu, "A-Learning for Approximate Planning", NIPS 2004 [PDF]
Tilman Börgers and Rajiv Sarin, "Learning Through Reinforcement and Replicator Dynamics", Journal of Economic Theory 77 (1997): 1--14
Vivek S. Borkar, "Reinforcement Learning in Markovian Evolutionary Games", Advances in Complex Systems 5 (2002): 55--72
Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma, "Regret Bounds for Sleeping Experts and Bandits", Machine Learning 80 (2010): 245--272
R. Andrew McCallum, "Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State", pp. 387--395 of Armand Prieditis and Stuart J. Russell (eds.), Proceedings of the Twelth International Machine Learning Conference (ICML 1995)
Susan A. Murphy, "Optimal Dynamic Treatment Regimes", Journal of the Royal Statistical Society B 65 (2003): 331--366 [PDF]
D. I. Simester, P. Sun and J. Tsitsiklis, "Dynamic Catalog Mailing Policies" [PDF]

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare, "Deep Reinforcement Learning at the Edge of the Statistical Precipice", arxiv:2108.13264
John Asmuth, Michael L. Littman, "Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search", arxiv:1202.3699
Marc G. Bellemare, Will Dabney, Mark Rowland, Distributional Reinforcement Learning
Dimitri P. Bertsekas and John Tsitsiklis, Neuro-Dynamic Programming
Byron Boots, Geoffrey J. Gordon, "Predictive State Temporal Difference Learning", arxiv:1011.0041, NIPS 23 (2010)
Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade, "Towards minimax policies for online linear optimization with bandit feedback", arxiv:1202.3079
Bibhas Chakraborty and Susan A. Murphy, "Dynamic Treatment Regimes", Annual Review of Statistics and Its Application 1 (2014): 447--464
Nathaniel D. Daw, John P. O'Doherty, Peter Dayan, Ben Seymour and Raymond J. Dolan, "Cortical substrates for exploratory decisions in humans", Nature 441 (2006): 876--879 [Hey, sometimes people do act like reinforcement-learners!]
A. Philip Dawid and Vanessa Didelez, "Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview", Statistics Surveys 4 (2010): 184--231
Eyal Even-Dar, Shie Mansour and Yishay Mansour, "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems", Journal of Machine Learning Research 7 (2006): 1079--1105 [Yay for confidence intervals!]
Noah Golowich, Ankur Moitra, Dhruv Rohatgi, "Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning", arxiv:2404.03774
Siva Gorantla, Todd Coleman, "Information-Theoretic Viewpoints on Optimal Causal Coding-Decoding Problems", arxiv:1102.0250
Milos Hauskrecht, "Value-Function Approximation for Partially Observable Markov Decision Processes", Journal of Artificial Intelligence Research 13 (2000): 33--94 [JAIR publishes the full text of articles online for free, but I feel insufficiently motivated to track down the link right now]
F. Y. Hunt, "Sample path optimality for a Markov optimization problem", Stochastic Processes and their Applications 115 (2005): 769--779
Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore, "Reinforcement Learning: A Survey," Journal of Artificial Intelligence Research 4 (1996): 237--285
Stephen Kelly and Malcolm I. Heywood, "Emergent Solutions to High-Dimensional Multitask Reinforcement Learning", Evolutionary Computation 26 (2018): 347--380
Ivo Kwee, Marcus Hutter and Juergen Schmidhuber, "Market-Based Reinforcement Learning in Partially Observable Worlds," cs.AI/0105025
Cassidy Laidlaw, Stuart J Russell, Anca Dragan, "Bridging RL Theory and Practice with the Effective Horizon", NeurIPS 2023
Felix Leibfried, Sergio Pascual-Diaz, Jordi Grau-Moya, "A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment", arxiv:1907.12392
Susan A. Murphy
- "A Generalization Error for Q-Learning" [PDF]
- "An Experimental Design for the Development of Adaptive Treatment Strategies", Statistics and Medcine forthcoming [PDF]
Sean Meyn, Control Systems and Reinforcement Learning
Thomas M. Moerland, Joost Broekens, Aske Plaat and Catholijn M. Jonker, "Model-based Reinforcement Learning: A Survey", Foundations and Trends in Machine Learning 16 (2023): 1--118
Andrew Patterson, Samuel Neumann, Martha White, Adam White, "Empirical Design in Reinforcement Learning", arxiv:2304.01315 [They mean experimental design...]
Misha Perepelitsa, "A model of discrete choice based on reinforcement learning under short-term memory", arxiv:1908.06133
Leonid Peshkin and Sayan Mukherjee, "Bounds on sample size for policy evaluation in Markov environments," cs.LG/0105027
A. Potapov and M. K. Ali, "Convergence of reinforcement learning algorithms and acceleration of learning," Physical Review E 67 (2003): 026706
Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
Benjamin Recht, "A Tour of Reinforcement Learning: The View from Continuous Control", Annual Review of Control, Robotics, and Autonomous Systems 2 (2019): 253--279
Philippe Rigollet and Assaf Zeevi, "Nonparametric Bandits with Covariates", arxiv:1003.1630
Sayanti Roy, Emily Kieson, Charles Abramson, Christopher Crick, "Mutual Reinforcement Learning", arxiv:1907.06725
Brian Sallans, Reinforcement Learning for Factored Markov Decision Processes [Abstract, download]
Murray Shanahan, Melanie Mitchell, "Abstraction for Deep Reinforcement Learning", arxiv:2202.05839
R. L. Stratonovich, Conditional Markov Processes and Their Application to the Theory of Optimal Control
Istvan Szita and Andras Lorincz, "The many faces of optimism", arxiv:0810.3451 ["exploration-exploitation dilemma ... of reinforcement learning. 'Optimism in the face of uncertainty' and model building play central roles in advanced exploration methods. ... a fast and simple algorithm ... finds a near-optimal policy in polynomial time... experimental evidence that it is robust and efficient..."]
Alexander G. Tartakovsky, "Asymptotic Optimality of Certain Multihypothesis Sequential Tests: Non-i.i.d. Case", Statistical Inference for Stochastic Processes 1 (1998): 265--295
John Tsitsiklis and collaborators, Papers on Neuro-Dynamic Programming
Pascal Van Hentenryck and Russell Bent, Online Stochastic Combinatorial Optimization