Ensemble Methods in Machine Learning
Last update: 08 Dec 2024 00:12First version: 26 July 2006
Boosting, bagging, binning, stacking, mixtures of experts, ...
I have an Idea about how to use model averaging to cope with non-stationary time series forecasting, but need to find time to work on it. [Update: Shalizi et al., 2011, link below.]
Value of diversity.
My recommendations here are more than usually scattered and inadequate.
- See also:
- Collective Cognition
- Learning Theory
- Model Selection
- (Decision/Prediction/Classification/Regression) Trees
- Recommended, bigger picture:
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Leo Breiman, "Random Forests", Machine Learning 45 (2001): 5--32
- Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging [Review: How Can You Pick Just One?]
- Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online. Ensemble methods as an apparent violation of Occam's Razor.]
- Anders Krogh and Jesper Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning", NIPS 7 (1994): 231--238 [Almost none of this is specific to neural networks]
- Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]
- Recommended (totally inadequate, what happened to come to mind cleaning
up my files):
- Pierre Alquier and Olivier Wintenberger, "Model selection and randomization for weakly dependent time series forecasting", Bernoulli 18 (2012): 883--913, arxiv:0902.2924
- Sanjeev Arora, Elad Hazan and Satyen Kale, "The Multiplicative Weights Update Method: a Meta Algorithm and Applications " [PDF preprint. This is an interesting kind of result, which promises performance which comes to close that achieved by any strategy within a fixed class, no matter what sequence of data is observed --- but it's performance on that sequence, which, as the saying goes, "is no guarantee of future results". Cesa-Bianchi and Lugosi's book has a lot more along these lines.]
- Daniel Berend and Aryeh Kontorovich, "Consistency of weighted majority votes", arxiv:1312.0451
- Peter Bühlmann and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications [For the extensive treatment of boosting. Mini-review]
- Marie Devaine, Pierre Gaillard, Yannig Goude, Gilles Stoltz, "Forecasting electricity consumption by aggregating specialized experts", arxiv:1207.1965
- Bruce E. Hansen
- "Least Squares Model Averaging", Econometrica 75 (2007): 1175--1189 [Reprint via Prof. Hansen]
- "Least Squares Forecast Averaging", Journal of Econometrics 146 (2008): 342--350 [Reprint via Prof. Hansen]
- Elad Hazan and Satyen Kale, "Extracting certainty from uncertainty: regret bounded by variation in costs", Machine Learning 80 (2010): 165--188
- Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma, "Regret Bounds for Sleeping Experts and Bandits", Machine Learning 80 (2010): 245--272
- J. Zico Kolter and Marcus A. Maloof
- "Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts", Journal of Machine Learning Research 8 (2007): 2755--2790
- "Using Additive Expert Ensembles to Cope with Concept Drift", ICML 2005 [PDF reprint via Kolter]
- Wei-Yin Ko, Daniel D'souza, Karina Nguyen, Randall Balestriero, Sara Hooker, "FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling", arxiv:2303.00586
- A. Juditsky, P. Rigollet, A. B. Tsybakov, "Learning by mirror averaging", arxiv:math/0511468 = Annals of Statistics 36 (2008): 2183--2206
- G. Langer and U. Parlitz, "Modeling parameter dependence from time series", Physical Review E 70 (2004): 056217 [Interesting use of ensemble methods in state space modeling]
- Guillaume Lecué and Charles Mitchell, "Oracle inequalities for cross-validation type procedures", Electronic Journal of Statistics 6 (2012): 1803--1837 [Much of this is actually about model averaging]
- Laurence K. Saul and Michael I. Jordan, "Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones", Machine Learning 37 (1999): 75--87
- Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods", Annals of Statistics 26 (1998): 1651--1686
- Kyupil Yeon, Moon Sup Song, Yongdai Kim, Hosik Choi, Cheolwoo Park, "Model averaging via penalized regression for tracking concept drift", Journal of Computational and Graphical Statistics online before print (2010)
- Modesty forbids me to recommend:
- CRS, the lectures on ensemble methods in my data mining class (13 and 14 in the 2022 iteration)
- CRS, Abigail Z. Jacobs, Kristina L. Klinkner, and Aaron Clauset, "Adapting to Non-stationarity with Growing Expert Ensembles", arxiv:1103.0949
- To read:
- Ran Avnimelech and Nathan Intrator, "Boosted Mixture of Experts: An Ensemble Learning Scheme", Neural Computation 11 (1999): 483--497
- Larry M. Bartels, "Specification Uncertainty and Model
Averaging", American Journal of Political
- Gérard Biau, "Analysis of a Random Forests Model", Journal of Machine Learning Research 13 (2012): 1063--1095
- Gérard Biau, Luc Devroye and Gábor Lugosi, "Consistency of Random Forests and Other Averaging Classifiers", Journal of Machine Learning Research 9 (2008): 2015--2033 ["In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent."]
- Gavin Brown, Jeremy L. Wyatt and Pter Tino, "Managing Diversity in Regression Ensembles", Journal of Machine Learning Research 6 (2005): 1621--1650
- Peter Bühlmann and Torsten Hothorn, "Boosting Algorithms: Regularization, Prediction and Model Fitting", Statistical Science 22 (2007): 477--505, arxiv:0804.2752 [with commentary following]
- Bruno Caprile, Cesare Furlanello and Stefano Merler, "The Dynamics of AdaBoost Weights Tells You What's Hard to Classify," cs.LG/0201014
- Kamalika Chaudhuri, Yoav Freund, Daniel Hsu, "A parameter-free hedging algorithm", arxiv:0903.2851 [Doing about as well as a given fraction of the ensemble]
- Zhuo Chen and Yuhong Yan, "Time Series Models for Forecasting: Testing or Combining?", Studies in Nonlinear Dynamics and Econometrics 11:1 (2007): 3
- Matthieu Cornec, "Estimating Subbagging by cross-validation", arxiv:1011.5142
- Alicia Curth, Alan Jeffares, Mihaela van der Schaar, "Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers", arxiv:2402.01502
- M. Di Marzio and C. C. Taylor, "Kernel density classification and boosting: an L2 analysis", Statistics and Computing 15 (2005): 113--123
- Narayanan U. Edakunni, Gary Brown, Tim Kovacs, "Boosting as a Product of Experts", UAI 2011, arxiv:1202.3716
- John Ehrlinger and Hemant Ishwaran, "Characterizing $ Boosting", Annals of Statistics 40 (2012): 1074--1101
- Yoav Freund, "A more robust boosting algorithm", arxiv:0905.2138
- Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722 = math.ST/0410092
- Yoav Freund, Robert E. Schapire, Yoram Singer and Manfred K. Warmuth, "Using and combining predictors that specialize" [PDF preprint]
- Jerome H. Friedman, Bogdan E. Popescu, "Predictive learning via rule ensembles", arxiv:0811.1679
- G. Fumera and F. Roli, "A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 942--956
- Stéphane Gaïffas and Guillaume Lecué, "Hyper-Sparse Optimal Aggregation", Journal of Machine Learning Research 12 (2011): 1813--1833
- Nicolas Garcia-Pedrajas, Cesar Garcia-Osorio and Colin Fyfe, "Nonlinear Boosting Projections for Ensemble Construction", Journal of Machine Learning Research 8 (2007): 1--33
- Alexander Goldenshluger, "A universal procedure for aggregating estimators", arxiv:0704.2500 = Annals of Statistics 37 (2009): 542--568
- Etienne Grossmann, "A Theory of Probabilistic Boosting, Decision Trees and Matryoshki", cs.LG/0607110
- Bettina Grün, Ioannis Kosmidis, Achim Zeileis, "Extended Beta Regression in R: Shaken, Stirred, Mixed, and Partitioned", Journal of Statistical Software 48 (2012): 11
- Haijie Gu, John Lafferty, "Sequential Nonparametric Regression", arxiv:1206.6408
- S. Gualdi, A. De Martino, "How does informational heterogeneity affect the quality of forecasts?", arxiv:0906.0552
- Jakob Vogdrup Hansen, Combining Predictors: Meta Machine Learning Methods and Bias/Variance & Ambiguity Decompositions [Ph.D. thesis, University of Aarhus, 2000; on-line]
- Geoffrey E. Hinton, "Training Products of Experts by Minimizing Contrastive Divergence," Neural Computation 14 (2002): 1771--1800.
- Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid, "A Framework for Unbiased Model Selection Based on Boosting", Journal of Computational and Graphical Statistics forthcoming (2011)
- Marcus Hutter and Jan Poland, "Adaptive Online Prediction by Following the Perturbed Leader", cs.AI/0504078 = Journal of Machine Learning Research 6 (2005): 639--660
- Robert A. Jacobs, "Bias/Variance Analyses of Mixtures-of-Experts Architectures", Neural Computation 9 (1997): 369--383
- Wenxin Jiang, "Boosting with Noisy Data: Some Views from Statistical Theory", Neural Computation 16 (2004): 789--810
- Rie Johnson, Tong Zhang, "Learning Nonlinear Functions Using Regularized Greedy Forest", arxiv:1109.0887
- Nicole Kraemer, "Boosting for Functional Data", math.ST/0605751
- Ludmila I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms
- Kasper Green Larsen, "Bagging is an Optimal PAC Learner", arxiv:2212.02264
- Tri M. Le, Bertrand S. Clarke, "Model Averaging Is Asymptotically Better Than Model Selection For Prediction", Journal of Machine Learning Research 23 (2022): 33
- Guillaume Lecu&eaucte;
- "Lower Bounds and Aggregation in Density Estimation", Journal of Machine Learning Research 7 (2006): 971--981
- "Empirical risk minimization is optimal for the convex aggregation problem", Bernoulli 19 (2013): 2153--2166
- David Mease, Abraham J. Wyner and Andreas Buja, "Boosted Classification Trees and Class Probability/Quantile Estimation", Journal of Machine Learning Research 8 (2007): 409--439
- Nicolai Meinshausen, "Forest Garrote", arxiv:0906.3590
- David J. Miller and Siddharth Pal, "Transductive Methods for the Distributed Ensemble Classification Problem", Neural Computation 19 (2007): 856--884
- Andriy Norets, "Approximation of conditional densities by smooth mixtures of regressions", Annals of Statistics 38 (2010): 1733--1766, arxiv:1010.0581
- L. Nunes and E. Oliveira, "On Learning by Exchanging Advice," cs.LG/0203010
- Frenando C. Pereira and Yoram Singer, "An Efficient Extension to Mixture Techniques for Prediction and Decision Trees", Machine Learning 36 (1999): 183--199
- Evgueni Petrov, "Constraint-based analysis of composite solvers," cs.AI/0302036
- Benedikt M. Pötscher, "The distribution of model averaging estimators and an impossibility result regarding its estimation", arxiv:math/0702781
- Philippe Rigollet, "Maximum likelihood aggregation and misspecified generalized linear models", Annals of Statistics 40 (2012): 639--665, arxiv:0911.2919
- Stephanie Sapp, Mark J. van der Laan, and John Canny, "Subsemble: An Ensemble Method for Combining Subset-Specific Algorithm Fits", working paper 313, Berkeley dept. of biostatistics (2013)
- Ville A. Satopää, Shane T. Jensen, Robin Pemantle, Lyle H. Ungar, "Partial Information Framework: Aggregating Estimates from Diverse Information Sources", arxiv:1505.06472
- Yoram Singer, "Adaptive Mixtures of Probabilistic Transducers", Neural Computation 9 (1997): 1711--1733 [PS.gz preprint]
- David S. Siroky, "Navigating Random Forests and related advances in algorithmic modeling", Statistics Surveys 3 (2009): 147--163
- Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science 292 (2002): 447-464
- Brandon M. Turner, Mark Steyvers, Edgar C. Merkle, David V. Budescu, Thomas S. Wallsten, "Forecast aggregation via recalibration", Machine Learning 95 (2014): 261--289
- Peter Welinder, Steve Branson, Serge Belongie and Pietro Perona, "The Multidimensional Wisdom of Crowds", NIPS 2011 (NIPS 23) [PDF reprint]
- Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach, "Transcendence: Generative Models Can Outperform The Experts That Train Them", arxiv:2406.11741
- Héla Zouari, Laurent Heutte and Yves Lecourtier, "Controlling the diversity in classifier ensembles through a measure of agreement", Pattern Recognition 38 (2005): 2195--2199