Ensemble Methods in Machine Learning

Last update: 21 Apr 2025 21:17
First version: 26 July 2006

Boosting, bagging, binning, stacking, mixtures of experts, ...

I have an Idea about how to use model averaging to cope with non-stationary time series forecasting, but need to find time to work on it. [Update: Shalizi et al., 2011, link below.]

Value of diversity.

My recommendations here are more than usually scattered and inadequate.

Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
Leo Breiman, "Random Forests", Machine Learning 45 (2001): 5--32
Gerda Claeskens and Nils Lid Hjort, Model Selection and Model Averaging [Review: How Can You Pick Just One?]
Pedro Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data Mining and Knowledge Discovery, 3 (1999) [Online. Ensemble methods as an apparent violation of Occam's Razor.]
Anders Krogh and Jesper Vedelsby, "Neural Network Ensembles, Cross Validation, and Active Learning", NIPS 7 (1994): 231--238 [Almost none of this is specific to neural networks]
Robert E. Schapire and Yoav Freund, Boosting: Foundations and Algorithms [Review: Weak Learners of the World, Unite!]

Pierre Alquier and Olivier Wintenberger, "Model selection and randomization for weakly dependent time series forecasting", Bernoulli 18 (2012): 883--913, arxiv:0902.2924
Sanjeev Arora, Elad Hazan and Satyen Kale, "The Multiplicative Weights Update Method: a Meta Algorithm and Applications " [PDF preprint. This is an interesting kind of result, which promises performance which comes to close that achieved by any strategy within a fixed class, no matter what sequence of data is observed --- but it's performance on that sequence, which, as the saying goes, "is no guarantee of future results". Cesa-Bianchi and Lugosi's book has a lot more along these lines.]
Daniel Berend and Aryeh Kontorovich, "Consistency of weighted majority votes", arxiv:1312.0451
Peter Bühlmann and Sara van de Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications [For the extensive treatment of boosting. Mini-review]
Marie Devaine, Pierre Gaillard, Yannig Goude, Gilles Stoltz, "Forecasting electricity consumption by aggregating specialized experts", arxiv:1207.1965
Bruce E. Hansen
- "Least Squares Model Averaging", Econometrica 75 (2007): 1175--1189 [Reprint via Prof. Hansen]
- "Least Squares Forecast Averaging", Journal of Econometrics 146 (2008): 342--350 [Reprint via Prof. Hansen]
Elad Hazan and Satyen Kale, "Extracting certainty from uncertainty: regret bounded by variation in costs", Machine Learning 80 (2010): 165--188
Robert Kleinberg, Alexandru Niculescu-Mizil, Yogeshwer Sharma, "Regret Bounds for Sleeping Experts and Bandits", Machine Learning 80 (2010): 245--272
J. Zico Kolter and Marcus A. Maloof
- "Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts", Journal of Machine Learning Research 8 (2007): 2755--2790
- "Using Additive Expert Ensembles to Cope with Concept Drift", ICML 2005 [PDF reprint via Kolter]
Wei-Yin Ko, Daniel D'souza, Karina Nguyen, Randall Balestriero, Sara Hooker, "FAIR-Ensemble: When Fairness Naturally Emerges From Deep Ensembling", arxiv:2303.00586
A. Juditsky, P. Rigollet, A. B. Tsybakov, "Learning by mirror averaging", arxiv:math/0511468 = Annals of Statistics 36 (2008): 2183--2206
G. Langer and U. Parlitz, "Modeling parameter dependence from time series", Physical Review E 70 (2004): 056217 [Interesting use of ensemble methods in state space modeling]
Guillaume Lecué and Charles Mitchell, "Oracle inequalities for cross-validation type procedures", Electronic Journal of Statistics 6 (2012): 1803--1837 [Much of this is actually about model averaging]
Laurence K. Saul and Michael I. Jordan, "Mixed Memory Markov Models: Decomposing Complex Stochastic Processes as Mixtures of Simpler Ones", Machine Learning 37 (1999): 75--87
Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods", Annals of Statistics 26 (1998): 1651--1686
Kyupil Yeon, Moon Sup Song, Yongdai Kim, Hosik Choi, Cheolwoo Park, "Model averaging via penalized regression for tracking concept drift", Journal of Computational and Graphical Statistics online before print (2010)

CRS, the lectures on ensemble methods in my data mining class (13 and 14 in the 2022 iteration)
CRS, Abigail Z. Jacobs, Kristina L. Klinkner, and Aaron Clauset, "Adapting to Non-stationarity with Growing Expert Ensembles", arxiv:1103.0949

Ran Avnimelech and Nathan Intrator, "Boosted Mixture of Experts: An Ensemble Learning Scheme", Neural Computation 11 (1999): 483--497
Larry M. Bartels, "Specification Uncertainty and Model Averaging", American Journal of Political
Gérard Biau, "Analysis of a Random Forests Model", Journal of Machine Learning Research 13 (2012): 1063--1095
Gérard Biau, Luc Devroye and Gábor Lugosi, "Consistency of Random Forests and Other Averaging Classifiers", Journal of Machine Learning Research 9 (2008): 2015--2033 ["In the last years of his life, Leo Breiman promoted random forests for use in classification. He suggested using averaging as a means of obtaining good discrimination rules. The base classifiers used for averaging are simple and randomized, often based on random samples from the data. He left a few questions unanswered regarding the consistency of such rules. In this paper, we give a number of theorems that establish the universal consistency of averaging rules. We also show that some popular classifiers, including one suggested by Breiman, are not universally consistent."]
Gavin Brown, Jeremy L. Wyatt and Pter Tino, "Managing Diversity in Regression Ensembles", Journal of Machine Learning Research 6 (2005): 1621--1650
Peter Bühlmann and Torsten Hothorn, "Boosting Algorithms: Regularization, Prediction and Model Fitting", Statistical Science 22 (2007): 477--505, arxiv:0804.2752 [with commentary following]
Bruno Caprile, Cesare Furlanello and Stefano Merler, "The Dynamics of AdaBoost Weights Tells You What's Hard to Classify," cs.LG/0201014
Kamalika Chaudhuri, Yoav Freund, Daniel Hsu, "A parameter-free hedging algorithm", arxiv:0903.2851 [Doing about as well as a given fraction of the ensemble]
Zhuo Chen and Yuhong Yan, "Time Series Models for Forecasting: Testing or Combining?", Studies in Nonlinear Dynamics and Econometrics 11:1 (2007): 3
Matthieu Cornec, "Estimating Subbagging by cross-validation", arxiv:1011.5142
Alicia Curth, Alan Jeffares, Mihaela van der Schaar, "Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers", arxiv:2402.01502
M. Di Marzio and C. C. Taylor, "Kernel density classification and boosting: an L2 analysis", Statistics and Computing 15 (2005): 113--123
Narayanan U. Edakunni, Gary Brown, Tim Kovacs, "Boosting as a Product of Experts", UAI 2011, arxiv:1202.3716
John Ehrlinger and Hemant Ishwaran, "Characterizing $ Boosting", Annals of Statistics 40 (2012): 1074--1101
Yoav Freund, "A more robust boosting algorithm", arxiv:0905.2138
Yoav Freund, Yishay Mansour and Robert E. Schapire, "Generalization bounds for averaged classifiers", Annals of Statistics 32 (2004): 1698--1722 = math.ST/0410092
Yoav Freund, Robert E. Schapire, Yoram Singer and Manfred K. Warmuth, "Using and combining predictors that specialize" [PDF preprint]
Jerome H. Friedman, Bogdan E. Popescu, "Predictive learning via rule ensembles", arxiv:0811.1679
G. Fumera and F. Roli, "A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 942--956
Stéphane Gaïffas and Guillaume Lecué, "Hyper-Sparse Optimal Aggregation", Journal of Machine Learning Research 12 (2011): 1813--1833
Nicolas Garcia-Pedrajas, Cesar Garcia-Osorio and Colin Fyfe, "Nonlinear Boosting Projections for Ensemble Construction", Journal of Machine Learning Research 8 (2007): 1--33
Alexander Goldenshluger, "A universal procedure for aggregating estimators", arxiv:0704.2500 = Annals of Statistics 37 (2009): 542--568
Etienne Grossmann, "A Theory of Probabilistic Boosting, Decision Trees and Matryoshki", cs.LG/0607110
Bettina Grün, Ioannis Kosmidis, Achim Zeileis, "Extended Beta Regression in R: Shaken, Stirred, Mixed, and Partitioned", Journal of Statistical Software 48 (2012): 11
Haijie Gu, John Lafferty, "Sequential Nonparametric Regression", arxiv:1206.6408
S. Gualdi, A. De Martino, "How does informational heterogeneity affect the quality of forecasts?", arxiv:0906.0552
Jakob Vogdrup Hansen, Combining Predictors: Meta Machine Learning Methods and Bias/Variance & Ambiguity Decompositions [Ph.D. thesis, University of Aarhus, 2000; on-line]
Geoffrey E. Hinton, "Training Products of Experts by Minimizing Contrastive Divergence," Neural Computation 14 (2002): 1771--1800.
Benjamin Hofner, Torsten Hothorn, Thomas Kneib, and Matthias Schmid, "A Framework for Unbiased Model Selection Based on Boosting", Journal of Computational and Graphical Statistics forthcoming (2011)
Marcus Hutter and Jan Poland, "Adaptive Online Prediction by Following the Perturbed Leader", cs.AI/0504078 = Journal of Machine Learning Research 6 (2005): 639--660
Robert A. Jacobs, "Bias/Variance Analyses of Mixtures-of-Experts Architectures", Neural Computation 9 (1997): 369--383
Wenxin Jiang, "Boosting with Noisy Data: Some Views from Statistical Theory", Neural Computation 16 (2004): 789--810
Rie Johnson, Tong Zhang, "Learning Nonlinear Functions Using Regularized Greedy Forest", arxiv:1109.0887
Nicole Kraemer, "Boosting for Functional Data", math.ST/0605751
Ludmila I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms
Kasper Green Larsen, "Bagging is an Optimal PAC Learner", arxiv:2212.02264
Tri M. Le, Bertrand S. Clarke, "Model Averaging Is Asymptotically Better Than Model Selection For Prediction", Journal of Machine Learning Research 23 (2022): 33
Guillaume Lecu&eaucte;
- "Lower Bounds and Aggregation in Density Estimation", Journal of Machine Learning Research 7 (2006): 971--981
- "Empirical risk minimization is optimal for the convex aggregation problem", Bernoulli 19 (2013): 2153--2166
David Mease, Abraham J. Wyner and Andreas Buja, "Boosted Classification Trees and Class Probability/Quantile Estimation", Journal of Machine Learning Research 8 (2007): 409--439
Nicolai Meinshausen, "Forest Garrote", arxiv:0906.3590
David J. Miller and Siddharth Pal, "Transductive Methods for the Distributed Ensemble Classification Problem", Neural Computation 19 (2007): 856--884
Andriy Norets, "Approximation of conditional densities by smooth mixtures of regressions", Annals of Statistics 38 (2010): 1733--1766, arxiv:1010.0581
L. Nunes and E. Oliveira, "On Learning by Exchanging Advice," cs.LG/0203010
Frenando C. Pereira and Yoram Singer, "An Efficient Extension to Mixture Techniques for Prediction and Decision Trees", Machine Learning 36 (1999): 183--199
Evgueni Petrov, "Constraint-based analysis of composite solvers," cs.AI/0302036
Benedikt M. Pötscher, "The distribution of model averaging estimators and an impossibility result regarding its estimation", arxiv:math/0702781
Philippe Rigollet, "Maximum likelihood aggregation and misspecified generalized linear models", Annals of Statistics 40 (2012): 639--665, arxiv:0911.2919
Stephanie Sapp, Mark J. van der Laan, and John Canny, "Subsemble: An Ensemble Method for Combining Subset-Specific Algorithm Fits", working paper 313, Berkeley dept. of biostatistics (2013)
Ville A. Satopää, Shane T. Jensen, Robin Pemantle, Lyle H. Ungar, "Partial Information Framework: Aggregating Estimates from Diverse Information Sources", arxiv:1505.06472
Yoram Singer, "Adaptive Mixtures of Probabilistic Transducers", Neural Computation 9 (1997): 1711--1733 [PS.gz preprint]
David S. Siroky, "Navigating Random Forests and related advances in algorithmic modeling", Statistics Surveys 3 (2009): 147--163
Eiji Takimoto and Akira Maruoka, "Top-down decision tree learning as information based boosting," Theoretical Computer Science 292 (2002): 447-464
Brandon M. Turner, Mark Steyvers, Edgar C. Merkle, David V. Budescu, Thomas S. Wallsten, "Forecast aggregation via recalibration", Machine Learning 95 (2014): 261--289
Peter Welinder, Steve Branson, Serge Belongie and Pietro Perona, "The Multidimensional Wisdom of Crowds", NIPS 2011 (NIPS 23) [PDF reprint]
Edwin Zhang, Vincent Zhu, Naomi Saphra, Anat Kleiman, Benjamin L. Edelman, Milind Tambe, Sham M. Kakade, Eran Malach, "Transcendence: Generative Models Can Outperform The Experts That Train Them", arxiv:2406.11741
Héla Zouari, Laurent Heutte and Yves Lecourtier, "Controlling the diversity in classifier ensembles through a measure of agreement", Pattern Recognition 38 (2005): 2195--2199