November 09, 2015

"Inference in the Presence of Network Dependence Due to Contagion" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about statistical inference with network data, and (2) will be in Pittsburgh next week.

A (perhaps) too-skeptical view of statistics is that we should always think we have $ n=1 $, because our data set is a single, effectively irreproducible, object. With a lot of care and trouble, we can obtain things very close to independent samples in surveys and experiments. When we get to time series or spatial data, independence becomes a myth we must abandon, but we still hope that we can break up the data set into many nearly-independent chunks. To make those ideas plausible, though, we need to have observations which are widely separated from each other. And those asymptotic-independence stories themselves seem like myths when we come to networks, where, famously, everyone is close to everyone else. The skeptic would, at this point, refrain from drawing any inference whatsoever from network data. Fortunately for the discipline, Betsy Ogburn is not such a skeptic.

Elizabeth Ogburn, "Inference in the Presence of Network Dependence Due to Contagion"
Abstract: Interest in and availability of social network data has led to increasing attempts to make causal and statistical inferences using data collected from subjects linked by social network ties. But inference about all kinds of estimands, starting with simple sample means, is challenging when only a single network of non-independent observations is available. There is a dearth of principled methods for dealing with the dependence that such observations can manifest. We describe methods for causal and semiparametric inference when the dependence is due solely to the transmission of information or outcomes along network ties.
Time and place: 4--5 pm on Monday, 16 November 2015, in 1112 Doherty Hall

As always, the talk is free and open to the public.

Enigmas of Chance; Networks

Posted at November 09, 2015 22:14 | permanent link

"Statistical Estimation with Random Forests" (This Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) are interested in seeing machine learning methods turned (back) into ordinary inferential statistics, and (2) will be in Pittsburgh on Wednesday.

Leo Breiman's random forests have long been one of the poster children for what he called "algorithmic models", detached from his "data models" of data-generating processes. I am not sure whether developing classical, data-model statistical-inferential theory for random forests would please him, or has him spinning in his grave, but either way I'm sure it will make for an interesting talk.

Stefan Wager, "Statistical Estimation with Random Forests"
Abstract: Random forests, introduced by Breiman (2001), are among the most widely used machine learning algorithms today, with applications in fields as varied as ecology, genetics, and remote sensing. Random forests have been found empirically to fit complex interactions in high dimensions, all while remaining strikingly resilient to overfitting. In principle, these qualities ought to also make random forests good statistical estimators. However, our current understanding of the statistics of random forest predictions is not good enough to make random forests usable as a part of a standard applied statistics pipeline: in particular, we lack robust consistency guarantees and asymptotic inferential tools. In this talk, I will present some recent results that seek to overcome these limitations. The first half of the talk develops a Gaussian theory for random forests in low dimensions that allows for valid asymptotic inference, and applies the resulting methodology to the problem of heterogeneous treatment effect estimation. The second half of the talk then considers high-dimensional properties of regression trees and forests in a setting motivated by the work of Berk et al. (2013) on valid post-selection inference; at a high level, we find that the amount by which a random forest can overfit to training data scales only logarithmically in the ambient dimension of the problem.
(This talk is based on joint work with Susan Athey, Brad Efron, Trevor Hastie, and Guenther Walther.)
Time and place: 4--5 pm on Wednesday, 11 November 2015 in Doherty Hall 1112

As always, the talk is free and open to the public.

Enigmas of Chance

Posted at November 09, 2015 16:23 | permanent link

Three-Toed Sloth