"Statistical Estimation with Random Forests" (This Week at the Statistics Seminar)
Attention conservation notice: Only of interest if you (1) are
interested in seeing machine learning methods turned (back) into ordinary inferential statistics, and (2) will be in Pittsburgh on Wednesday.
Leo Breiman's random forests have long been one of the poster children for
what he called
"algorithmic models", detached from his "data models" of data-generating
processes. I am not sure whether developing classical, data-model
statistical-inferential theory for random forests would please him, or has him
spinning in his grave, but either way I'm sure it will make for an interesting
- Stefan Wager, "Statistical Estimation with Random Forests"
- Abstract: Random forests, introduced
by Breiman (2001), are
among the most widely used machine learning algorithms today, with applications
in fields as varied as ecology, genetics, and remote sensing. Random forests
have been found empirically to fit complex interactions in high dimensions, all
while remaining strikingly resilient to overfitting. In principle, these
qualities ought to also make random forests good statistical estimators.
However, our current understanding of the statistics of random forest
predictions is not good enough to make random forests usable as a part of a
standard applied statistics pipeline: in particular, we lack robust consistency
guarantees and asymptotic inferential tools. In this talk, I will present some
recent results that seek to overcome these limitations. The first half of the
talk develops a Gaussian theory for random forests in low dimensions that
allows for valid asymptotic inference, and applies the resulting methodology to
the problem of heterogeneous treatment effect estimation. The second half of
the talk then considers high-dimensional properties of regression trees and
forests in a setting motivated by the work
of Berk et al. (2013) on valid
post-selection inference; at a high level, we find that the amount by which a
random forest can overfit to training data scales only logarithmically in the
ambient dimension of the problem.
- (This talk is based on joint work with Susan Athey, Brad Efron,
Trevor Hastie, and Guenther Walther.)
- Time and place: 4--5 pm on Wednesday, 11 November 2015 in Doherty Hall 1112
As always, the talk is free and open to the public.
Enigmas of Chance
Posted at November 09, 2015 16:23 | permanent link