## Sequential Decision-Making Under Stochastic Uncertainty

*27 Feb 2017 16:30*

Yet Another Inadequate Placeholder.

That said... I'm interested in the theory of optimal decision-making, when you need to make multiple decisions over time, and there is non-trivial stochastic uncertainty, either because the effects of your actions are somewhat random, or because you can only coarsely and noisily measure the state of the system you're acting on. I am particularly interested in the extent to which optimal strategies can be learned, in the usual "probably approximately correct" sense of computational learning theory. Here it seems that there is a potentially very important difference between trying to learn an optimal strategy on the basis of merely historical, haphazard data, versus actually performing experiments. In fact, in some sense the best way to learn about the optimal policy may be to experiment with a totally random policy, because the data you gather from such an experiment is totally free of outside, confounding factors. (Similarly, one way to learn about the properties of a nonlinear system is to measure its response to white noise; this is the Wiener method for transducers.)

Finding the optimal strategy turns out to be a very hard problem, both computationally and statistically, and it seems staggeringly unlikely that most human beings, when faced with such situations, respond in anything like the optimal manner. (This is part of the reason things like DSGE models in macroeconomics are crazy.) Or, rather, if we do act optimally, it's with respect to a non-obvious criterion.

Related or subsidiary topics which will also show up here: Partially-observable Markov decision processes, reinforcement learning, etc., etc.

People sometimes distinguish between "risk", which can be represented
stochastically, i.e., as a probability distribution, and "uncertainty", where
there is simply no basis for assessing frequencies or the like.
Decision-making under such strong or genuine uncertainty is a *very
different problem*...

See also: Control Theory; Decision Theory; Filtering and State Estimation; Learning in Games; Low-Regret Learning; Neural Control of Action; Statistics; Universal Prediction

- Recommended, big picture:
- David Blackwell and M. A. Girshick, Theory of Games and Statistical Decisions
- Bent Jesper Christensen and Nicholas M. Kiefer, Economic Modeling and Inference [Review: An Optimal Path to a Dead End.]
- Rich Sutton, Reinforcement Learning FAQ
- Richard S. Sutton and Andrew G. Barto, Reinforcement Learning [Book website, with draft text online]

- Recommended, close-ups:
- Paul H. Algoet [Algoet did some brilliant stuff in the general area
of information-theoretic approaches to
stochastic processes in the early and mid 1990s, and then just stopped
publishing. Whatever happened to him?]
- "Universal Schemes for Prediction, Gambling, and Portfolio
Selection," Annals of Probability
**20**(1992): 901--941 and an important Correction,**23**(1995): 474--478 - "The Strong Law of Large Numbers for Sequential Decisions
Under Uncertainty," IEEE Transactions on Information
Theory
**40**(1994): 609--633

- "Universal Schemes for Prediction, Gambling, and Portfolio
Selection," Annals of Probability
- Doron Blatt, Susan A. Murphy and Ji Zhu, "A-Learning for Approximate Planning", NIPS 2004 [PDF]
- Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma,
"Regret Bounds for Sleeping Experts and Bandits", Machine Learning
**80**(2010): 245--272 - Susan
A. Murphy, "Optimal Dynamic Treatment Regimes", Journal of the Royal
Statistical Society B
**65**(2003): 331--366 [PDF] - D. I. Simester, P. Sun and J. Tsitsiklis, "Dynamic Catalog Mailing Policies" [PDF]

- To read:
- Dimitri P. Bertsekas and John Tsitsiklis, Neuro-Dynamic Programming
- Byron Boots, Geoffrey J. Gordon, "Predictive State Temporal Difference Learning", arxiv:1011.0041, NIPS 23 (2010)
- Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade, "Towards minimax policies for online linear optimization with bandit feedback", arxiv:1202.3079
- Nathaniel D. Daw, John P. O'Doherty, Peter Dayan, Ben Seymour and
Raymond J. Dolan, "Cortical substrates for exploratory decisions in humans",
Nature
**441**(2006): 876--879 [Hey, sometimes people*do*act like reinforcement-learners!] - A. Philip Dawid and Vanessa Didelez, "Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview", Statistics Surveys
**4**(2010): 184--231 - Eyal Even-Dar, Shie Mansour and Yishay Mansour, "Action Elimination
and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning
Problems", Journal of
Machine Learning Research
**7**(2006): 1079--1105 [Yay for confidence intervals!] - Milos Hauskrecht, "Value-Function Approximation for Partially
Observable Markov Decision Processes", Journal of Artificial Intelligence
Research
**13**(2000): 33--94 [JAIR publishes the full text of articles online for free, but I feel insufficiently motivated to track down the link right now] - F. Y. Hunt, "Sample path optimality for a Markov optimization
problem", Stochastic Processes
and their Applications
**115**(2005): 769--779 - Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore,
"Reinforcement Learning: A Survey," Journal of Artificial Intelligence
Research
**4**(1996): 237--285 - Ivo Kwee, Marcus Hutter and Juergen Schmidhuber, "Market-Based Reinforcement Learning in Partially Observable Worlds," cs.AI/0105025
- Susan A. Murphy
- Leonid Peshkin and Sayan Mukherjee, "Bounds on sample size for policy evaluation in Markov environments," cs.LG/0105027
- A. Potapov and M. K. Ali, "Convergence of reinforcement learning
algorithms and acceleration of learning," Physical Review E
**67**(2003): 026706 - Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Philippe Rigollet and Assaf Zeevi, "Nonparametric Bandits with Covariates", arxiv:1003.1630
- Brian Sallans, Reinforcement Learning for Factored Markov Decision Processes [Abstract, download]
- R. L. Stratonovich, Conditional Markov Processes and Their Application to the Theory of Optimal Control
- Istvan Szita and Andras Lorincz, "The many faces of optimism", arxiv:0810.3451 ["exploration-exploitation dilemma ... of reinforcement learning. 'Optimism in the face of uncertainty' and model building play central roles in advanced exploration methods. ... a fast and simple algorithm ... finds a near-optimal policy in polynomial time... experimental evidence that it is robust and efficient..."]
- Alexander G. Tartakovsky, "Asymptotic Optimality of Certain
Multihypothesis Sequential Tests: Non-i.i.d. Case", Statistical Inference
for Stochastic Processes
**1**(1998): 265--295 - John Tsitsiklis and collaborators, Papers on Neuro-Dynamic Programming
- Pascal Van Hentenryck and Russell Bent, Online Stochastic Combinatorial Optimization