Notebooks

## Sequential Decision-Making Under Stochastic Uncertainty

14 Aug 2023 13:44

Yet Another Inadequate Placeholder.

That said... I'm interested in the theory of optimal decision-making, when you need to make multiple decisions over time, and there is non-trivial stochastic uncertainty, either because the effects of your actions are somewhat random, or because you can only coarsely and noisily measure the state of the system you're acting on. I am particularly interested in the extent to which optimal strategies can be learned, in the usual "probably approximately correct" sense of computational learning theory. Here it seems that there is a potentially very important difference between trying to learn an optimal strategy on the basis of merely historical, haphazard data, versus actually performing experiments. In fact, in some sense the best way to learn about the optimal policy may be to experiment with a totally random policy, because the data you gather from such an experiment is totally free of outside, confounding factors. (Similarly, one way to learn about the properties of a nonlinear system is to measure its response to white noise; this is the Wiener method for transducers.)

Finding the optimal strategy turns out to be a very hard problem, both computationally and statistically, and it seems staggeringly unlikely that most human beings, when faced with such situations, respond in anything like the optimal manner. (This is part of the reason things like DSGE models in macroeconomics are crazy.) Or, rather, if we do act optimally, it's with respect to a non-obvious criterion.

Related or subsidiary topics which will also show up here: Partially-observable Markov decision processes, reinforcement learning, etc., etc.

People sometimes distinguish between "risk", which can be represented stochastically, i.e., as a probability distribution, and "uncertainty", where there is simply no basis for assessing frequencies or the like. Decision-making under such strong or genuine uncertainty is a very different problem, better addressed using the tools of low-regret learning.

Recommended, close-ups:
• Paul H. Algoet [Algoet did some brilliant stuff in the general area of information-theoretic approaches to stochastic processes in the early and mid 1990s, and then just stopped publishing. Whatever happened to him?]
• "Universal Schemes for Prediction, Gambling, and Portfolio Selection," Annals of Probability 20 (1992): 901--941 and an important Correction, 23 (1995): 474--478
• "The Strong Law of Large Numbers for Sequential Decisions Under Uncertainty," IEEE Transactions on Information Theory 40 (1994): 609--633
• Doron Blatt, Susan A. Murphy and Ji Zhu, "A-Learning for Approximate Planning", NIPS 2004 [PDF]
• Tilman Börgers and Rajiv Sarin, "Learning Through Reinforcement and Replicator Dynamics", Journal of Economic Theory 77 (1997): 1--14
• Vivek S. Borkar, "Reinforcement Learning in Markovian Evolutionary Games", Advances in Complex Systems 5 (2002): 55--72
• Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma, "Regret Bounds for Sleeping Experts and Bandits", Machine Learning 80 (2010): 245--272
• R. Andrew McCallum, "Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State", pp. 387--395 of Armand Prieditis and Stuart J. Russell (eds.), Proceedings of the Twelth International Machine Learning Conference (ICML 1995)
• Susan A. Murphy, "Optimal Dynamic Treatment Regimes", Journal of the Royal Statistical Society B 65 (2003): 331--366 [PDF]
• D. I. Simester, P. Sun and J. Tsitsiklis, "Dynamic Catalog Mailing Policies" [PDF]
• Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare, "Deep Reinforcement Learning at the Edge of the Statistical Precipice", arxiv:2108.13264
• John Asmuth, Michael L. Littman, "Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search", arxiv:1202.3699
• Marc G. Bellemare, Will Dabney, Mark Rowland, Distributional Reinforcement Learning
• Dimitri P. Bertsekas and John Tsitsiklis, Neuro-Dynamic Programming
• Byron Boots, Geoffrey J. Gordon, "Predictive State Temporal Difference Learning", arxiv:1011.0041, NIPS 23 (2010)
• Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade, "Towards minimax policies for online linear optimization with bandit feedback", arxiv:1202.3079
• Bibhas Chakraborty and Susan A. Murphy, "Dynamic Treatment Regimes", Annual Review of Statistics and Its Application 1 (2014): 447--464
• Nathaniel D. Daw, John P. O'Doherty, Peter Dayan, Ben Seymour and Raymond J. Dolan, "Cortical substrates for exploratory decisions in humans", Nature 441 (2006): 876--879 [Hey, sometimes people do act like reinforcement-learners!]
• A. Philip Dawid and Vanessa Didelez, "Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview", Statistics Surveys 4 (2010): 184--231
• Eyal Even-Dar, Shie Mansour and Yishay Mansour, "Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems", Journal of Machine Learning Research 7 (2006): 1079--1105 [Yay for confidence intervals!]
• Siva Gorantla, Todd Coleman, "Information-Theoretic Viewpoints on Optimal Causal Coding-Decoding Problems", arxiv:1102.0250
• Milos Hauskrecht, "Value-Function Approximation for Partially Observable Markov Decision Processes", Journal of Artificial Intelligence Research 13 (2000): 33--94 [JAIR publishes the full text of articles online for free, but I feel insufficiently motivated to track down the link right now]
• F. Y. Hunt, "Sample path optimality for a Markov optimization problem", Stochastic Processes and their Applications 115 (2005): 769--779
• Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore, "Reinforcement Learning: A Survey," Journal of Artificial Intelligence Research 4 (1996): 237--285
• Stephen Kelly and Malcolm I. Heywood, "Emergent Solutions to High-Dimensional Multitask Reinforcement Learning", Evolutionary Computation 26 (2018): 347--380
• Ivo Kwee, Marcus Hutter and Juergen Schmidhuber, "Market-Based Reinforcement Learning in Partially Observable Worlds," cs.AI/0105025
• Felix Leibfried, Sergio Pascual-Diaz, Jordi Grau-Moya, "A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment", arxiv:1907.12392
• Susan A. Murphy
• "A Generalization Error for Q-Learning" [PDF]
• "An Experimental Design for the Development of Adaptive Treatment Strategies", Statistics and Medcine forthcoming [PDF]
• Sean Meyn, Control Systems and Reinforcement Learning
• Thomas M. Moerland, Joost Broekens, Aske Plaat and Catholijn M. Jonker, "Model-based Reinforcement Learning: A Survey", Foundations and Trends in Machine Learning 16 (2023): 1--118
• Misha Perepelitsa, "A model of discrete choice based on reinforcement learning under short-term memory", arxiv:1908.06133
• Leonid Peshkin and Sayan Mukherjee, "Bounds on sample size for policy evaluation in Markov environments," cs.LG/0105027
• A. Potapov and M. K. Ali, "Convergence of reinforcement learning algorithms and acceleration of learning," Physical Review E 67 (2003): 026706
• Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
• Benjamin Recht, "A Tour of Reinforcement Learning: The View from Continuous Control", Annual Review of Control, Robotics, and Autonomous Systems 2 (2019): 253--279
• Philippe Rigollet and Assaf Zeevi, "Nonparametric Bandits with Covariates", arxiv:1003.1630
• Sayanti Roy, Emily Kieson, Charles Abramson, Christopher Crick, "Mutual Reinforcement Learning", arxiv:1907.06725
• Brian Sallans, Reinforcement Learning for Factored Markov Decision Processes [Abstract, download]
• Murray Shanahan, Melanie Mitchell, "Abstraction for Deep Reinforcement Learning", arxiv:2202.05839
• R. L. Stratonovich, Conditional Markov Processes and Their Application to the Theory of Optimal Control
• Istvan Szita and Andras Lorincz, "The many faces of optimism", arxiv:0810.3451 ["exploration-exploitation dilemma ... of reinforcement learning. 'Optimism in the face of uncertainty' and model building play central roles in advanced exploration methods. ... a fast and simple algorithm ... finds a near-optimal policy in polynomial time... experimental evidence that it is robust and efficient..."]
• Alexander G. Tartakovsky, "Asymptotic Optimality of Certain Multihypothesis Sequential Tests: Non-i.i.d. Case", Statistical Inference for Stochastic Processes 1 (1998): 265--295
• John Tsitsiklis and collaborators, Papers on Neuro-Dynamic Programming
• Pascal Van Hentenryck and Russell Bent, Online Stochastic Combinatorial Optimization