(Decision, Classification, Regression, Prediction) Trees in Statistics and Machine Learning
Last update: 07 Dec 2024 23:36First version: 10 October 2024
Yet Another Inadequate Placeholder, because I don't feel like re-writing a chapter from my textbook.
--- One of the things that interests me about trees as statistical models is that they're extremely useful, highly interpretable, can approximate just about anything to arbitrary precision, etc., etc., but I literally cannot think of a situation where saying "a tree is a well-specified model for this system" has any plausibility at all. (It's been long enough since I read Breiman's "Two Cultures" essay that I can't remember if he was explicit about that, but I can't imagine it didn't shape his thoughts.)
- See also:
- Data Mining
- Ensemble Methods in Machine Learning
- Machine Learning, Statistical Inference, and Induction
- Regression
- Recommended, big picture:
- Trever Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction [Website, with full text free in PDF]
- Brian D. Ripley, Pattern Recognition and Neural Networks, ch. 7
- Recommended, close-ups:
- Nathaniel D. Phillips, Hansjörg Neth, Jan K. Woike and Wolfgang Gaissmaier, "FFTrees: A toolbox to create, visualize, and evaluate fast-and-frugal decision trees", Judgment and Decision Making 12 (2017): 344--368
- Yong Wang, Ilze Ziedins, Mark Holmes, Neil Challands, "Tree models for difference and change detection in a complex environment", Annals of Applied Statistics 6 (2012): 1162--1184, arxiv:1202.1561 [In an ordinary classification tree, we are interested in the distribution of the class labels \( Y \) given the predictors \( X \), i.e., \( \Pr(Y|X) \), and make splits on \( X \) so that (in essence) the conditional entropy \( H[Y|X] \) becomes small. This is of course equivalent to making splits so that the divergence of \( Pr(Y|X) \) from \( Pr(Y) \) is maximized. What they are interested in is not classification but describing how the different classes are distinct, so the relevant distribution is \( Pr(X|Y) \), and they want a big divergence between \( Pr(X) \) and \( Pr(X|Y) \).]
- Modesty forbids me to recommend:
- The chapter on trees in Advanced Data Analysis from an Elementary Point of View
- To read, big picture:
- Leo Breiman, Jerome Friedman, R. Olshen and C. Stone, Classification and Regression Trees [1984. I need to finish one of these years.]
- Konstantinos V. Katsikopoulos, Özgür Şimşek, Marcus Buckmann and Gerd Gigerenzer, Classification in the Wild: The Science and Art of Transparent Decision Making [Preliminary comments]
- Charles C. Ragin, The Comparative Method: Moving Beyond Qualitative and Quantitative Strategies [1987. From what I can work out by reading about this book, it expounds a somewhat eccentric algorithm for constructing classification trees, sublimely detached from the rest of the literature...]
- To read, close ups:
- Maria Florina Balcan and Dravyansh Sharma, "Learning Accurate and Interpretable Decision Trees", UAI 2024, pp. 288--307
- Moulinath Banerjee and Ian W. McKeague, "Confidence sets for split points in decision trees", Annals of Statistics 35 (2007): 543--574, arxiv:0708.1820
- Jayanta Basak, "Online Adaptive Decision Trees", Neural Computation 16 (2004): 1959--1981
- Hendrik Blockeel and Jan Struyf, "Efficient algorithms for decision tree cross-validation," cs.LG/0110036
- Alex Goldstein, Andreas Buja, "Penalized Split Criteria for Interpretable Trees", arxiv:1310.5677
- Hemant Ishwaran, "Variable importance in binary regression trees and forests", Electronic Journal of Statistics 1 (2007): 519--537, arxiv:0711.2434
- Jason M. Klusowski, "Analyzing CART", arxiv:1906.10086
- Jason M. Klusowski, Peter M. Tian, "Large Scale Prediction with Decision Trees", arxiv:2104.13881
- Sebastian Nowozin, "Improved Information Gain Estimates for Decision Tree Induction", arxiv:1206.4620
- Daria Sorokina, Rich Caruana and Mirek Riedewald, "Additive Groves of Regression Trees", ECML 2007 [PDF]
- Daria Sorokina, Rich Caruana, Mirek Riedewald and Daniel Fink, "Detecting Statistical Interactions with Additive Groves of Trees", ICML 2008 [PDF]
- Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science 292 (2002): 447-464
- N. Denizcan Vanli, Suleyman S. Kozat, "A Comprehensive Approach to Universal Piecewise Nonlinear Regression Based on Trees", arxiv:1311.6392
- Arman Zharmagambetov, Suryabhan Singh Hada, Miguel Á. Carreira-Perpiñán, Magzhan Gabidolla, "An Experimental Comparison of Old and New Decision Tree Algorithms", arxiv:1911.03054
- Ruoqing Zhu, Donglin Zeng, and Michael R. Kosorok, "Reinforcement Learning Trees", University of North Carolina at Chapel Hill Department of Biostatistics Technical Report No. 33 (2012) [That is, using reinforcement learning to grow better trees, rather than using trees in sequential decision-making.]