Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenSequential Decision-Making Under Stochastic Uncertainty
http://bactra.org/notebooks/2019/09/24#sequential-decisions
<P>Yet Another Inadequate Placeholder.
<P>That said... I'm interested in the theory of optimal decision-making, when
you need to make multiple decisions over time, and there is non-trivial
stochastic uncertainty, either because the effects of your actions are somewhat
random, or because you can only coarsely and noisily measure the state of the
system you're acting on. I am particularly interested in the extent to which
optimal strategies can be learned, in the usual "probably approximately
correct" sense of <a href="learning-theory.html">computational learning
theory</a>. Here it seems that there is a potentially very important
difference between trying to learn an optimal strategy on the basis of merely
historical, haphazard data, versus actually performing experiments. In fact,
in some sense the best way to learn about the optimal policy may be to
experiment with a totally random policy, because the data you gather from such
an experiment is totally free of outside, confounding factors. (Similarly, one
way to learn about the properties of a nonlinear system is to measure its
response to white noise; this is the <a href="wiener.html">Wiener</a> method
for <a href="transducers.html">transducers</a>.)
<P>Finding the optimal strategy turns out to be a very hard problem, both
computationally and statistically, and it seems staggeringly unlikely that most
human beings, when faced with such situations, respond in anything like the
optimal manner. (This is part of the reason things
like <a href="dsges.html">DSGE models</a> in macroeconomics are crazy.) Or,
rather, if we do act optimally, it's with respect to a non-obvious criterion.
<P>Related or subsidiary topics which will also show up here:
Partially-observable <a href="markov.html">Markov</a> decision processes,
reinforcement learning, etc., etc.
<P>People sometimes distinguish between "risk", which can be represented
stochastically, i.e., as a probability distribution, and "uncertainty", where
there is simply no basis for assessing frequencies or the like.
Decision-making under such strong or genuine uncertainty is a <em>very
different problem</em>...
<P>See also:
<a href="control.html">Control Theory</a>;
<a href="decision-theory.html">Decision Theory</a>;
<a href="filtering.html">Filtering and State Estimation</a>;
<a href="learning-games.html">Learning in Games</a>;
<a href="low-regret-learning">Low-Regret Learning</a>;
<a href="seriatim.html">Neural Control of Action</a>;
<a href="statistics.html">Statistics</a>;
<a href="universal-prediction.html">Universal Prediction</a>
<ul>Recommended, big picture:
<li>David Blackwell and M. A. Girshick, <cite>Theory of Games and
Statistical Decisions</cite>
<li>Bent Jesper Christensen and Nicholas M. Kiefer, <cite>Economic
Modeling and Inference</cite> [Review: <a href="../reviews/christensen-kiefer/">An Optimal Path to a Dead End</a>.]
<li>Rich Sutton, <a
href="http://www.cs.ualberta.ca/~sutton/RL-FAQ.html">Reinforcement Learning
FAQ</a>
<li>Richard S. Sutton and Andrew G. Barto, <cite>Reinforcement
Learning</cite> [<a
href="http://www.cs.ualberta.ca/~sutton/book/the-book.html">Book website</a>,
with draft text online]
</ul>
<ul>Recommended, close-ups:
<li>Paul H. Algoet [Algoet did some brilliant stuff in the general area
of <a href="information-theory.html">information-theoretic</a> approaches to
stochastic processes in the early and mid 1990s, and then just stopped
publishing. Whatever happened to him?]
<ul>
<li>"Universal Schemes for Prediction, Gambling, and Portfolio
Selection," <cite>Annals of Probability</cite> <strong>20</strong> (1992):
901--941 and an important Correction, <strong>23</strong> (1995): 474--478
<li>"The Strong Law of Large Numbers for Sequential Decisions
Under Uncertainty," <cite>IEEE Transactions on Information
Theory</cite> <strong>40</strong> (1994): 609--633
</ul>
<li><a href="http://www-personal.engin.umich.edu/~dblatt/">Doron
Blatt</a>, Susan A. Murphy and <a
href="http://www.stat.lsa.umich.edu/~jizhu/">Ji Zhu</a>, "A-Learning for
Approximate Planning", NIPS 2004 [<a
href="http://www.stat.lsa.umich.edu/~samurphy/papers/AlearningNips2004.pdf">PDF</a>]
<li>Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma,
"Regret Bounds for Sleeping Experts and Bandits", <a href="http://dx.doi.org/10.1007/s10994-010-5178-7"><cite>Machine Learning</cite> <strong>80</strong> (2010): 245--272</a>
<li><a href="http://www.stat.lsa.umich.edu/~samurphy/">Susan
A. Murphy</a>, "Optimal Dynamic Treatment Regimes", <cite>Journal of the Royal
Statistical Society B</cite> <strong>65</strong> (2003): 331--366 [<a
href="http://www.stat.lsa.umich.edu/~samurphy/papers/optimal.pdf">PDF</a>]
<li>D. I. Simester, P. Sun and J. Tsitsiklis, "Dynamic Catalog Mailing
Policies" [<a
href="http://web.mit.edu/jnt/www/Papers/P-03-sun-catalog-rev.pdf">PDF</a>]
</ul>
<ul>To read:
<li>Dimitri P. Bertsekas and John Tsitsiklis, <cite>Neuro-Dynamic
Programming</cite>
<li>Byron Boots, Geoffrey J. Gordon, "Predictive State Temporal Difference Learning", <a href="http://arxiv.org/abs/1011.0041">arxiv:1011.0041</a>,
<a href="http://books.nips.cc/papers/files/nips23/NIPS2010_1174.pdf">NIPS 23 (2010)</a>
<li>Sébastien Bubeck, Nicolò Cesa-Bianchi, Sham M. Kakade, "Towards minimax policies for online linear optimization with bandit feedback", <a href="http://arxiv.org/abs/1202.3079">arxiv:1202.3079</a>
<li>Bibhas Chakraborty and Susan A. Murphy, "Dynamic Treatment Regimes",
<a href="https://doi.org/10.1146/annurev-statistics-022513-115553"><cite>Annual Review of Statistics and Its Application</cite> <strong>1</strong> (2014): 447--464</a>
<li>Nathaniel D. Daw, John P. O'Doherty, Peter Dayan, Ben Seymour and
Raymond J. Dolan, "Cortical substrates for exploratory decisions in humans",
<a
href="http://dx.doi.org/10.1038/nature04766"><cite>Nature</cite> <strong>441</strong>
(2006): 876--879</a> [Hey, sometimes people <em>do</em> act like
reinforcement-learners!]
<li>A. Philip Dawid and Vanessa Didelez, "Identifying the consequences of dynamic treatment strategies: A decision-theoretic overview", <a href="http://projecteuclid.org/euclid.ssu/1289579930"><cite>Statistics Surveys</cite> <strong>4</strong> (2010): 184--231</a>
<li>Eyal Even-Dar, Shie Mansour and Yishay Mansour, "Action Elimination
and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning
Problems", <a
href="http://jmlr.csail.mit.edu/papers/v7/evendar06a.html"><cite>Journal of
Machine Learning Research</cite> <strong>7</strong> (2006): 1079--1105</a>
[Yay for confidence intervals!]
<li>Milos Hauskrecht, "Value-Function Approximation for Partially
Observable Markov Decision Processes", <cite>Journal of Artificial Intelligence
Research</cite> <strong>13</strong> (2000): 33--94 [JAIR publishes the full
text of articles online for free, but I feel insufficiently motivated to track
down the link right now]
<li>F. Y. Hunt, "Sample path optimality for a Markov optimization
problem", <a
href="http://dx.doi.org/10.1016/j.spa.2004.12.005"><cite>Stochastic Processes
and their Applications</cite> <strong>115</strong> (2005): 769--779</a>
<li>Leslie Pack Kaelbling, Michael L. Littman and Andrew W. Moore,
"Reinforcement Learning: A Survey," <cite>Journal of Artificial Intelligence
Research</cite> <strong>4</strong> (1996): 237--285
<li>Stephen Kelly and Malcolm I. Heywood, "Emergent Solutions to High-Dimensional Multitask Reinforcement Learning", <a href="https://doi.org/10.1162/evco_a_00232"><cite>Evolutionary Computation</cite> <strong>26</strong> (2018): 347--380</a>
<li>Ivo Kwee, Marcus Hutter and Juergen Schmidhuber, "Market-Based
Reinforcement Learning in Partially Observable Worlds," <a
href="http://arXiv.org/abs/cs/0105025">cs.AI/0105025</a>
<li>Felix Leibfried, Sergio Pascual-Diaz, Jordi Grau-Moya, "A Unified Bellman Optimality Principle Combining Reward Maximization and Empowerment", <a href="http://arxiv.org/abs/1907.12392">arxiv:1907.12392</a>
<li>Susan A. Murphy
<ul>
<li>"A Generalization Error for Q-Learning" [<a
href="http://www.stat.lsa.umich.edu/~samurphy/papers/QLearning.pdf">PDF</a>]
<li>"An Experimental Design for the Development of Adaptive
Treatment Strategies", <cite>Statistics and Medcine</cite> forthcoming [<a
href="http://www.stat.lsa.umich.edu/~samurphy/papers/ExperimentalEvidence.pdf">PDF</a>]
</ul>
<li>Misha Perepelitsa, "A model of discrete choice based on reinforcement learning under short-term memory", <a href="http://arxiv.org/abs/1908.06133">arxiv:1908.06133</a>
<li>Leonid Peshkin and Sayan Mukherjee, "Bounds on sample size for
policy evaluation in Markov environments,"
<a href="http://arXiv.org/abs/cs/0105027">cs.LG/0105027</a>
<li>A. Potapov and M. K. Ali, "Convergence of reinforcement learning
algorithms and acceleration of learning," <citE>Physical Review E</cite>
<strong>67</strong> (2003): 026706
<li>Martin L. Puterman, <cite>Markov Decision Processes: Discrete
Stochastic Dynamic Programming</cite>
<li>Benjamin Recht, "A Tour of Reinforcement Learning: The View from Continuous Control", <a href="https://doi.org/10.1146/annurev-control-053018-023825"><cite>Annual Review of Control, Robotics, and Autonomous Systems</cite> <strong>2</strong> (2019): 253--279</a>
<li>Philippe Rigollet and Assaf Zeevi, "Nonparametric Bandits with
Covariates", <a href="http://arxiv.org/abs/1003.1630">arxiv:1003.1630</a>
<li>Sayanti Roy, Emily Kieson, Charles Abramson, Christopher Crick, "Mutual Reinforcement Learning", <a href="http://arxiv.org/abs/1907.06725">arxiv:1907.06725</a>
<li>Brian Sallans, <cite>Reinforcement Learning for Factored Markov
Decision Processes</cite> [<a
href="http://www.ai.univie.ac.at/~brian/pthesis/pthabstract.html">Abstract</a>,
<a href="http://www.ai.univie.ac.at/~brian/pthesis/">download</a>]
<li>R. L. Stratonovich, <cite>Conditional Markov Processes and Their
Application to the Theory of Optimal Control</cite>
<li>Istvan Szita and Andras Lorincz, "The many faces of optimism",
<a href="http://arxiv.org/abs/0810.3451">arxiv:0810.3451</a> ["exploration-exploitation dilemma ... of reinforcement learning. 'Optimism in the face of uncertainty' and model building play central roles in advanced exploration methods. ... a fast and simple algorithm ... finds a near-optimal policy in polynomial time... experimental evidence that it is robust and efficient..."]
<li>Alexander G. Tartakovsky, "Asymptotic Optimality of Certain
Multihypothesis Sequential Tests: Non-i.i.d. Case", <cite>Statistical Inference
for Stochastic Processes</cite> <strong>1</strong> (1998): 265--295
<li>John Tsitsiklis and collaborators, <a
href="http://web.mit.edu/jnt/www/ndp.html">Papers on Neuro-Dynamic
Programming</a>
<li>Pascal Van Hentenryck and Russell Bent, <cite><a href="http://mitpress.mit.edu/0-262-22080-6">Online Stochastic
Combinatorial Optimization</a></cite>
</ul>