Attention conservation notice:Only of interest if you (1) care about large-scale data analysis and/or taxis, and (2) will be in Pittsburgh on~~Thursday~~Friday.

The last but by no means least talk seminar talk this week:

- Taylor Arnold, "Analyzing large-scale data: Taxi Tipping behavior in NYC"
*Abstract:*Statisticians are increasingly tasked with providing insights from large streaming data sources, which can quickly grow to be terabytes or petabytes in size. In this talk, I explore novel approaches for applying classical and emerging techniques to large-scale datasets. Specifically, I discuss methodologies for expressing estimators in terms of the (weighted) Gramian matrix and other easily distributed summary statistics. I then present an abstraction layer for implementing chunk-wise algorithms that are interoperable over many parallel and distributed software frameworks. The utility and insights garnered from these methods are shown through an application to an event based dataset provided by the New York City Taxi and Limousine Commission. I have joined these observations, which detail every registered taxicab trip from 2009 to the present, with external sources such as weather conditions and demographics. I use the aforementioned techniques to explore factors associated with taxi demand and the tipping behavior of riders. My focus is on developing novel techniques to facilitate interactive exploratory data analysis and to construct interpretable models at scale.*Time and place:*~~4:30--5:30 pm on Thursday, 25 February 2016, in Baker Hall A51~~

4:30--5:30 pm on Friday, 26 February 2016, in Baker Hall A51

As always, the talk is free and open to the public.

**Update**: Dr. Arnold's talk has been pushed back a day due to
travel delays.

Posted at February 25, 2016 11:16 | permanent link

Attention conservation notice:A half-clever dig at one of the more serious and constructive attempts to do something about an important problem that won't go away on its own. It doesn't even explain the idea it tries to undermine.

Jerzy's "cursory overview of differential privacy" post brings back to mind an idea which I doubt is original, but whose source I can't remember. (It's not Baumbauer et al.'s "Fool's Gold: an Illustrated Critique of Differential Privacy" [ssrn/2326746], though they do make a related point about multiple queries.)

The point of differential privacy is to guarantee that adding or removing any one person from the data base can't change the likelihood function by more than a certain factor; that the log-likelihood remains within $\pm \epsilon$. This is achieved by adding noise with a Laplace (double-exponential) distribution to the output of any query from the data base, with the magnitude of the noise being inversely related to the required bound $\epsilon$. (Tighter privacy bounds require more noise.)

The tricky bit is that these $\epsilon$s are *additive* across
queries. If the $i^{\mathrm{th}}$ query can change the log-likelihood by up to
$\pm \epsilon_i$, a series of queries can change the log-likelihood by up to
$\sum_{i}{\epsilon_i}$. If the data-base owner allows a constant $\epsilon$
per query, we can then break the privacy by making lots of queries.
Conversely, if the $\epsilon$ per query is not to be too tight, we can only
allow a small number of constant-$\epsilon$ queries. A final option is to
gradually ramp down the $\epsilon_i$ so that their sum remains finite, e.g.,
$\epsilon_i \propto i^{-2}$. This would mean that early queries were subject
to little distortion, but latter ones were more and more noisy.

One side effect of any of these schemes, which is what I want to bring out, is that they offer a way to make the database unusable, or nearly unusable, for everyone else. I make the queries I want (if any), and then flood the server with random, pointless queries about the number of cars driven by left-handed dentists in Albuquerque (or whatever). Either the server has a fixed $\epsilon$ per query, and so a fixed upper limit on the number of queries, or $\epsilon$ grows after each query. In the first case, the server has to stop answering others' queries; in the second, eventually they get only noise. Or --- more plausibly --- whoever runs the server has to abandon their differential privacy guarantee.

This same attack would also work, by the way, against the "re-usable
holdout". That paper
(not surprisingly, given the authors) is basically about creating a testing
set, and then answering predictive models' queries about it while guaranteeing
differential privacy. To keep the distortion from blowing up, only a limited
number of queries can be asked of the testing-set server. That is, the server
is explicitly allowed to return `NA`, rather than a proper answer, and
it will always do so after enough questions. In the situation they imagine,
though, of the server being a "leaderboard" in a competition among models, the
simple way to win is to put in a model early (even a decent model, for form's
sake), and then *keep* putting trivial variants of it in, as often as
possible, as quickly as possible. This is because each time I submit a model,
I deprive all my possible opponents of one use of the testing set, and if I'm
fast enough I can keep them from ever having their models tested at all.

Posted at February 25, 2016 11:09 | permanent link

Attention conservation notice: A ponderous elaboration of an acerbic line by Upton Sinclair. Written so long ago I've honestly forgotten what incident provoked it, then left to gather dust and re-discovered by accident.

A common defense of experts consulting for sometimes nefarious characters in
legal cases is that the money isn't corrupting, if the expert happens to agree
with the position anyway already. So, for instance, if someone with relevant
expertise has doubts about the link between cigarette smoking and cancer, or
between fossil-fuel burning and global warming, what harm does it do if they
accept money from Philip Morris or Exxon, to defray advocating this? By
assumption, they're not *lying* about their expert opinion.

The problem with this excuse is that it pretends people never change their
ideas. When we deal with each other as more-or-less honest people — when
we treat what others say as *communications* rather than
as *manipulations* — we do assume that those we're listening to
are telling us things more-or-less as they see them. But we are *also*
assuming that if the way they saw things changed, what they said would track
that change. If they encountered new evidence, or even just new arguments,
they would respond to them, they would evaluate them, and if they found them
persuasive, they would not only change their minds, they would admit that they
had done so.
(Cf.)
We know that can be *galling* for anyone to admit that they were wrong,
but that's part of what we're asking for when we trust experts.

And now the problem with the on-going paid advocacy relationship becomes
obvious. It adds material injury to emotional insult as a reason not to admit
that one has changed one's mind. The human animal being what it is, this
becomes a reason *not to change one's mind* --- to ignore, or to explain
away, new evidence and new argument.

Sometimes the new evidence is ambiguous, the new argument has real
weaknesses, and then this desire not to be persuaded by it can perform a real
intellectual function, with each
side sharpening each other. (You could call this "the cunning of reason"
if you wanted to be really pretentious.) But how is the non-expert to know
whether your objections are really sound, or whether you are desperately BS-ing
to preserve your retainer? Maybe they could figure it out, with a lot of work,
but they would be *right* to be suspicious.

Posted at February 24, 2016 00:26 | permanent link

Attention conservation notice:A failed attempt at a dialogue, combining the philosophical sophistication and easy approachability of statistical theory with the mathematical precision and practical application of epistemology, dragged out for 2500+ words (and equations). You have better things to do than read me vent about manuscripts I volunteered to referee.

**Scene**: *We ascend, by a dubious staircase, to the
garret loft space of Confectioner-Stevedore Hall, at
Robberbaron-Bloodmoney University, where we find two fragments of the author's
consciousness, temporarily incarnated as perpetual post-docs from the
Department of Statistical Data Science, sharing an unheated office.*

Q: Are you unhappy with the manuscript you're reviewing?

A: Yes, but I don't see why you care.

Q: The stabbing motions of your pen are both ostentatious and distracting. If I listen to you rant about it, will you go back to working without the semaphore?

A: I think that just means you find it easier to ignore my words than anything else, but I'm willing to try.

Q: So, what is getting you worked up about the manuscript?

A: They take a perfectly reasonable — though not obviously appropriate-to-the-problem — regularized estimator, and then go through immense effort to Bayesify it. They end up with about seven levels of hierarchical priors. Simple Metropolis-Hastings Monte Carlo would move as slowly as a continental plate, so they put vast efforts into speeding it up, and in a real technical triumph they get something which moves like a glacier.

Q: Isn't that rather fast these days?

A: If they try to scale up, my back-of-the-envelope calculation suggests they really will enter the regime where each data set will take a single Ph.D. thesis to analyze.

Q: So do you think that they're just masochists who're into frequentist pursuit, or do they have some reason for doing all these things that annoy you?

A: Their fondness for tables over figures does give me pause, but no, they
claim to have a point. If they do all this work, they say, they can use their
posterior distributions to *quantify uncertainty* in their estimates.

Q: That sounds like something statisticians should want to do. Haven't you been very pious about just that, about how handling uncertainty is what really sets statistics apart from other traditions of data analysis? Haven't I heard you say to students that they don't know anything until they know where the error bars go?

A: I suppose I have, though I don't recall that exact phrase. It's not the goal I object to, it's the way quantification of uncertainty is supposed to follow automatically from using Bayesian updating.

Q: You have to admit, the whole "posterior probability distribution over
parameter values" thing certainly *looks* like a way of expressing
uncertainty in quantitative form. In
fact, last time we went around
about this, didn't you admit that Bayesian agents *are* uncertain
about parameters, though not about the probabilities of observable events?

A: I did, and they are, though that's very different from agreeing that they
quantify uncertainty in any useful way — that they handle
uncertainty *well*.

Q: Fine, I'll play the straight man and offer a concrete proposal for you to poke holes in. Shall we keep it simple and just consider parametric inference?

A: By all means.

Q: Alright, then, I start with some prior probability distribution over a finite-dimensional vector-valued parameter \( \theta \), say with density \( \pi(\theta) \). I observe \( x \) and have a model which gives me the likelihood \( L(\theta) = p(x;\theta) \), and then my posterior distribution is fixed by \[ \pi(\theta|X=x) \propto L(\theta) \pi(\theta) \] This is my measure-valued estimate. If I want a set-valued estimate of \( \theta \), I can fix a level \( \alpha \) and chose a region \( C_{\alpha} \) with \[ \int_{C_{\alpha}}{\pi(\theta|X=x) d\theta} = \alpha \] Perhaps I even preferentially grow \( C_{\alpha} \) around the posterior mode, or something like that, so it looks pretty. How is \( C_{\alpha} \) not a reasonable way of quantifying my uncertainty about \( \theta \)?

A: To begin with, I don't know the probability that the true \( \theta \in C_{\alpha} \).

Q: How is it *not* \( \alpha \), like it says right there on the label?

A: Again, I don't understand what that means.

Q: Are you attacking subjective probability? Is that where this is going? OK: sometimes, when a Bayesian agent and a bookmaker love each other very much, the bookie will offer the Bayesian bets on whether \( \theta \in C_{\alpha} \), and the agent will be indifferent so long as the odds are \( \alpha : 1-\alpha \). And even if the bookie is really a damn dirty Dutch gold-digger, the agent can't be pumped dry of money. What part of this do you not understand?

A: I hardly know where to begin. I will leave aside the color commentary.
I will leave aside the internal issues
with Dutch
book arguments for conditionalization. I will not pursue the fascinating,
even
*revealing* idea that something which is supposedly a universal
requirement of rationality needs such very historically-specific institutions
and ideas as *money* and *making book* and *betting odds*
for its expression. The important thing is that you're telling me that \(
\alpha \), the level of credibility or confidence, is really about your betting
odds.

Q: Yes, and?

A: I do not see why should *I* care about the odds at
which *you* might bet. It's even worse than that, actually, I do not see
why I should care about the odds at which a machine you programmed with the
saddle-blanket prior (or, if we were doing nonparametrics, an
Afghan jirga process
prior) would bet. I fail to see how those odds help me learn anything about
the world, or even reasonably-warranted uncertainties in inferences about the
world.

Q: May I indulge in mythology for a moment?

A: Keep it clean, students may come by.

Q: That leaves out all the best myths, but very well. Each morning, when woken by rosy-fingered Dawn, the goddess Tyche picks \( \theta \) from (what else?) an urn, according to \( \pi(\theta) \). Tyche then draws \( x \) from \( p(X;\theta) \), and \( x \) is revealed to us by the Sibyl or the whisper of oak leaves or sheep's livers. Then we calculate \( \pi(\theta|X=x) \) and \( C_{\alpha} \). In consequence, the fraction of days on which \( \theta \in C_{\alpha} \) is about \( \alpha \). \( \alpha \) is how often the credible set is right, and \( 1-\alpha \) is one of those error rates you like to go on about. Does this myth satisfy you?

A: Not really. I get that "Bayesian analysis treats the parameters as random". In fact, that myth suggests a very simple yet universal Monte Carlo scheme for sampling from any posterior distribution whatsoever, without any Markov chains or burn-in.

Q: Can you say more?

A: I should actually write it up. But now let's try to de-mythologize. I want to know what happens if we get rid of Tyche, or at least demote her from resetting \( \theta \) every day to just picking \( x \) from \( p(x;\theta) \), with \( \theta \) fixed by Zeus.

Q: I think you mean Ananke, Zeus would meddle with the parameters to cheat on Hera. Anyway, what do you think happens?

A: Well, \( C_{\alpha} \) depends on the data, it's really \( C_{\alpha}(x) \). Since \( x \) is random, \( X \sim p(\cdot;\theta) \), so is \( C_{\alpha} \). It follows a distribution of its own, and we can ask about \( Pr_{\theta}(\theta \in C_{\alpha}(X) ) \).

Q: Haven't we just agreed that that probability is just \( \alpha \) ?

A: No, we've seen that \[ \int{Pr_{\theta}(\theta \in C_{\alpha}(X) ) \pi(\theta) d\theta} = \alpha \] but that is a very different thing.

Q: How different could it possibly be?

A: As different as we like, at any particular \( \theta \).

Q: Could the 99% credible sets contain \( \theta \) only, say, 1% of the time?

A: Absolutely. This is the scenario
of Larry's
playlet,
but he wrote that up because it actually *happened* in a project he was
involved in.

Q: Isn't it a bit artificial to worry about the long-run proportion of the time you're right about parameters?

A: The same argument works if you estimate many parameters at once. When
the brain-imaging people do fMRI experiments, they estimate how tens of
thousands of little regions in the brain ("voxels") respond to stimuli. That
means estimating tens of thousands of parameters. I don't think they'd be
happy if their 99% intervals turned out to contain the right answer for only 1%
of the voxels. But posterior betting odds don't have to have *anything*
to do with how often bets are right, and usually they don't.

Q: Isn't "usually" very strong there?

A: No, I don't think so. D. A. S. Fraser has a wonderful paper, which should be better known, called "Is Bayes Posterior Just Quick and Dirty Confidence", and his answer to his own question is basically "Yes. Yes it is." More formally, he shows that the conditions for Bayesian credible sets to have correct coverage, to be confidence sets, are incredibly restrictive.

Q: But what about the Bernstein-von Mises theorem? Doesn't it say we don't have to worry for big samples, that credible sets are asymptotically confidence sets?

A: Not really. It says that if you have a fixed-dimensional model, and the
usual regularity conditions for maximum likelihood estimation hold, so that \(
\hat{\theta}_{MLE} \rightsquigarrow \mathcal{N}(\theta, n^{-1}I(\theta)) \),
and some *more* regularity conditions hold, then the posterior
distribution is also asymptotically \( \mathcal{N}(\theta, n^{-1}I(\theta)) \).

Q: Wait, so the theorem says that when it applies, if I want to be Bayesian I might as well just skip all the MCMC and maximize the likelihood?

A: You might well think that. You might very well think that. I couldn't possibly comment.

Q: !?!

A: Except to add that the theorem breaks down in the high-dimensional regime
where the number of parameters grows with the number of samples,
and goes to hell
in the non-parametric regime of infinite-dimensional parameters. (In fact,
Fraser gives *one*-dimensional examples where the mis-match between
Bayesian credible levels and actual coverage is asymptotically \( O(1) \).)
As Freedman said,
if you want a confidence set, you need to build a confidence set, not mess
around with credible sets.

Q: But surely coverage — "confidence" — isn't all that's needed? Suppose I have only a discrete parameter space, and for each point I flip a coin which comes up heads with probability \( \alpha \). Now my \( C_{\alpha} \) is all the parameter points where the coin came up heads. Its expected coverage is \( \alpha \), as claimed. In fact, if I can come up with a Gygax test, say using the low-significance digits of \( x \), I could invert that to get my confidence set, and get coverage of \( \alpha \) exactly. What then?

A: I never said that coverage was *all* we needed from a set-valued
estimator. It should also be consistent: as we get more data, the set should
narrow in on \( \theta \), no matter what \( \theta \) happens to be. Your
Gygax sets won't do that. My point is that if you're going to use
probabilities ought to *mean* something, not just refer to some
imaginary gambling racket going on in your head.

Q: I am not going to let this point go so easily. It seems like you're insisting on calibration for Bayesian credible sets, that the fraction of them covering the truth be (about) the stated probability, right?

A: That seems like a pretty minimal requirement for treating supposed probabilities seriously. If (as on Discworld) "million to one chances turn up nine times out of ten", they're not really million to one.

Q: Fine — but isn't the Bayesian agent calibrated with probability 1?

A: With *subjective* probability 1.
But failure of calibration is actually
typical or generic, in the topological sense.

Q: But maybe the world we live in isn't "typical" in that weird sense of the topologists?

A: Maybe! The fact that Bayesian agents put probability 1 on the "meager" set of sample paths where they are calibrated implies that lots of stochastic processes are supported on topologically-atypical sets of paths. But now we're leaning a lot on a pre-established harmony between the world and our prior-and-model.

Q: Let me take another tack. What if calibration typically fails, but typically fails just a little — say probabilities are really \( p(1 \pm \epsilon ) \) when we think they're \( p \). Would you be very concerned, if \( \epsilon \) were small enough?

A: Honestly, no, but I have no good reason to think that, in general,
approximate calibration or coverage is much more common that exact calibration.
Anyway, we *know* that credible probabilities can be radically,
dismally off as coverage probabilities, so it seems like a moot point.

Q: So what sense do you make of the uncertainties which come out of Bayesian procedures?

A: "If we started with a population of guesses distributed like *this*,
and then selectively bred them to match the data, here's the dispersion of
the final guesses."

Q: You don't think that sounds both thin and complicated?

A: Of course it's both. (And it only gets more complicated if I explain "selective breeding" and "matching the data".) But it's the best sense I can make, these days, of Bayesian uncertainty quantification as she is computed.

Q: And what's your alternative?

A: I want to know about how differently the experiment, the estimate, could
have turned out, even if the underlying reality were the same. Standard errors
— or median absolute errors, etc. — and confidence sets are
about *that* sort of uncertainty, about re-running the experiment. You
might mess up, because your model is wrong, but at least there's a sensible
notion of probability in there, referring to things happening in the world.
The Bayesian alternative is some sort of sub-genetic-algorithm evolutionary
optimization routine you are supposedly running in your mind, while I run a
different one in my mind, etc.

Q: But what about all the criticisms of p-values and null hypothesis significance tests and so forth?

A: They *all* have Bayesian counterparts, as people like Andy Gelman
and Christian
Robert know
very well. The difficulties aren't about not being Bayesian, but about
things like testing stupid hypotheses, not accounting for multiple testing or
model
search, selective
reporting, insufficient
communication, etc. But now we're in danger of drifting really far from
our starting point about uncertainty in estimation.

Q: Would you sum that up then?

A: I don't believe the uncertainties you get from just slapping a prior on
something, even if you've chosen your prior so the MAP or the posterior mean
matches some reasonable penalized estimator. Give me some reason to think that
your posterior probabilities have *some* contact with reality, or I'll
just see them as "quick and dirty confidence" — only often not so quick
and *very* dirty.

Q: Is that what you're going to put in your referee report?

A: I'll be more polite.

**Disclaimer**: *Not a commentary on any specific talk,
paper, or statistician. One reason this is a failed attempt at a dialogue is
that there is more Q could have said in defense of the Bayesian approach, or at
least in objection to A. (I take some comfort in the fact that
it's traditional
for characters in dialogues to engage in high-class trolling.) Also, the
non-existent Robberbaron-Bloodmoney University is not to be confused with the
very real Carnegie Mellon University; for instance, the latter lacks a
hyphen.*

Posted at February 21, 2016 20:23 | permanent link

*Attention conservation notice:* Only of interest if (1) you care about statistics and complex systems, and (2) will be in Pittsburgh on Wednesday.

- Sumanta Basu, "Learning Dynamics of Complex Systems from High-Dimensional Datasets"
*Abstract:*The problem of learning interrelationships among the components of large, complex systems from noisy, high-dimensional datasets is common in many areas of modern economic and biological sciences. Examples include macroeconomic policy making, financial risk management, gene regulatory network reconstruction and elucidating functional roles of epigenetic regulators driving cellular mechanisms. In addition to their inherent computational challenges, principles statistical analyses of these big data problems often face unique challenges emerging from temporal and cross-sectional dependence in the data and complex dynamics (heterogeneity, nonlinear and high-order interactions) among the system components.- In this talk, I will start with network Granger causality --- a framework
for structure learning and forecasting of large dynamic systems from multivariate time series and panel datasets using regularized estimation of high-dimensional vector autoregressive models. I will discuss theoretical properties of the proposed estimates and demonstrate their advantages on a motivating application
from financial econometrics --- system-wide risk monitoring of the U.S. financial sector before, during and after the crisis of 2007--2009. I will conclude with some of my ongoing works on learning nonlinear and potentially high-order interactions in high-dimensional, heterogeneous settings. I will introduce iterative Random Forest (iRF), a supervised learning algorithm based on randomized
decision tree ensembles, that achieves predictive accuracy comparable to state-of-the-art learning machines and provides insight into high-order interaction relationships among features. I will demonstrate the usefulness of iRF on a motivating application from systems biology - learning epigenetic landscape of enhancer elements in
*Drosophila melanogaster*from next generation sequencing datasets. *Time and place:*4--5 pm on Wednesday, 24 February 2016, place TBA

As always, the talk is free and open to the public.

Posted at February 20, 2016 20:47 | permanent link

Attention conservation notice:Only of interest if you (1) care about evidence onhowinequality matters for health, and (2) will be in Pittsbrugh on Tuesday.

- Therri Usher, "Likelihood-Based Methods of Mediation Analysis in the Context of Health Disparities"
*Abstract:*African-Americans experience higher incidences of death and disability compared to non-Hispanic whites. Much of the existing research has focused on identifying the existence of health disparities, as methodological issues have hampered the development of health disparities research. In order to create solutions to eliminate health disparities, research must understand the mechanisms powering their existence.- Existing causal inference tools are not suitable for studying racial health disparities, as race cannot be manipulated or changed. For the same reason, mediators stand to be useful in creating avenues to intervene on existing health disparities. Structural equation modeling (SEM) may be a more promising tool for quantifying the causal framework of health disparities.
- One of the most widely-used tests for assessing mediation is the Sobel test (Sobel, 1982; MacKinnon et al, 2007). However, it has disadvantages, including lower power at smaller sample sizes. Therefore, this work focuses on three varying methods for assessing mediation and compares their performance to the Sobel test.
- The first method is an adjustment of the Sobel test that utilizes variance estimation using random covariates. The second method utilizes the joint distribution of the mediator and the outcome to determine profile likelihoods for the estimands of interest in order to derive distributions for their estimates. Finally, the third method utilizes Bayesian modeling techniques to fit the structural equation models and estimating the probability of mediation through quantile estimation. Simulations provided evidence that all three methods demonstrated comparable estimated statistical power compared to the Sobel test, often showcasing superior power at smaller sample sizes while providing more tools of inference into the presence of mediation.
- The methods were applied to assess whether diet mediates the relationship between race and blood pressure in non-Hispanic black and white subjects in the National Health and Nutrition Examination Survey (NHANES) from 1999-2004.
*Time and place:*4:30--5:30 pm on Tuesday, 23 February 2016, in Baker Hall A51

As always, the talk is free and open to the public.

Posted at February 20, 2016 20:46 | permanent link

Attention conservation notice:Only of interest if you (1) want to do high-dimensional regressions without claiming lots of discoveries which turn out to be false, and (2) will be in Pittsburgh on Monday.

"But Cosma", I hear you asking, "how can you be five talks into the spring seminar series without having had a single talk about false discovery rate control? Is the CMU department feeling quite itself?" I think you for your concern, and hope this will set your mind at ease:

- Weijie Su, "Multiple Testing and Adaptive Estimation via the Sorted L-One Norm"
*Abstract:*In many real-world statistical problems, we observe a large number of potentially explanatory variables of which a majority may be irrelevant. For this type of problem, controlling the false discovery rate (FDR) guarantees that most of the discoveries are truly explanatory and thus replicable. In this talk, we propose a new method named SLOPE to control the FDR in sparse high-dimensional linear regression. This computationally efficient procedure works by regularizing the fitted coefficients according to their ranks: the higher the rank, the larger the penalty. This is analogous to the Benjamini-Hochberg procedure, which compares more significant p-values with more stringent thresholds. Whenever the columns of the design matrix are not strongly correlated, we show empirically that SLOPE obtains FDR control at a reasonable level while offering substantial power.- Although SLOPE is developed from a multiple testing viewpoint, we show the surprising result that it achieves optimal squared errors under Gaussian random designs over a wide range of sparsity classes. An appealing feature is that SLOPE does not require any knowledge of the degree of sparsity. This adaptivitiy to unknown sparsity has to do with the FDR control, which strikes the right balance between bias and variance. The proof of this result presents several elements not found in the high-dimensional statistics literature.
*Time and place:*4--5 pm on Monday, 22 February 2016, in Scaife Hall 125.

As always, the talk is free and open to the public.

Posted at February 20, 2016 20:45 | permanent link

Attention conservation notice:A distinguished but elderly scientist philosophizes in public.

As a Judea Pearl fanboy, it is inevitable that I would help promote this:

- Judea Pearl, "Science, Counterfactuals and Free Will" (Dickson Prize Lecture)
*Abstract:*Counterfactuals, or fictitious changes, are the building blocks of scientific thought and the oxygen of moral behavior. The ability to reflect back on one's past actions and envision alternative scenarios is the basis of learning, free will, responsibility and social adaptation.- Recent progress in the algorithmization of counterfactuals has advanced our understanding of this mode of reasoning and has brought us a step closer toward equipping machines with similar capabilities. Dr. Pearl will first describe a computational model of counterfactual reasoning, and then pose some of the more difficult problems that counterfactuals present: why evolution has endowed humans with the illusion of free will, and how it manages to keep that illusion so vivid in our brain.
*Time and place:*noon--1 pm on Monday, 29 February 2016, in McConomy Auditorium, University Center

As always, the talk is free and open to the public.

ObLinkage 1: Clark Glymour, "We believe in freedom of the will so that we can learn".

ObLinkage 2: Mightn't that "illusion of free will" be the only sort
worth wanting?

Posted at February 15, 2016 16:49 | permanent link

Attention conservation notice:Only of interest if you (1) care about statistical methods for causal inference, and (2) will be in Pittsburgh on Thursday.

There are whole books on causal inference which make it seem like the
subject is exhausted by comparing the effect of The Treatment to the control
condition.
(*cough* Imbens
and Rubin *cough*) But any approach to causal inference which can't
grasp a dose-response curve might be sound but is not complete. Nor is there
any reason, in this day and age, to stick to simple regression. Fortunately,
we don't have to:

- Edward Kennedy, "Robust Causal Inference with Continuous Exposures" (arxiv:1507.00747)
*Abstract:*Continuous treatments (e.g., doses) arise often in practice, but standard causal effect estimators are limited: they either employ parametric models for the effect curve, or else do not allow for doubly robust covariate adjustment. Double robustness allows one of two nuisance estimators to be misspecified, and is important for protecting against model misspecification as well as reducing sensitivity to the curse of dimensionality. In this work we develop a novel approach for causal dose-response curve estimation that is doubly robust without requiring any parametric assumptions, and which naturally incorporates general off-the-shelf machine learning. We derive asymptotic properties for a kernel-based version of our approach and propose a method for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of hospital nurse staffing on excess readmissions penalties.*Time and place:*4:30--5:30 pm on Thursday, 18 February 2016, in Baker Hall A51

As always, the talk is free and open to the public.

Posted at February 15, 2016 16:31 | permanent link

Attention conservation notice:Only of interest if you (1) wish you could use the power of modern optimization to allocate an online advertising budget, and (2) will be in Pittsburgh on Tuesday.

I can think of at least two ways an Internet Thought Leader could make a splash by combining Tuesday's seminar with Maciej Ceglowski's "The Advertising Bubble". Fortunately for all concerned, I am not an Internet Thought Leader.

- Courtney Paulson, "Optimal Large-Scale Internet Media Selection" [PDF preprint]
*Abstract:*Although Internet advertising is vital in today's business world, research on optimal Internet media selection has been sparse. Firms face considerable challenges in their budget allocation decisions, including the large number of websites they may potentially choose, the vast variation in traffic and costs across websites, and the inevitable correlations in viewership among these sites. Due to these unique features, Internet advertising problems are actually a subset of a more diverse, general class of problems: penalized and constrained optimization. Generally, attempting to select the optimal subset of websites among all possible combinations is a NP-hard problem; as such, existing non-approaches can only handle Internet media selection in settings on the order of ten websites. Further, these approaches are not generalizable. Although generalizable penalized methodology exists to handle large-scale problems, this methodology cannot incorporate natural advertising constraints, such as budget allocation to particular websites or demographic weighting. We propose an optimization method that is computationally feasible to allocate advertising budgets among thousands of websites while also incorporating these common constraints. The method performs similarly to extant approaches in settings scalable to prior methods, but the method is also flexible enough to accommodate practical Internet advertising considerations such as targeted consumer demographics, mandatory media coverage to matched content websites, and target frequency of ad exposure.*Time and place:*~~4:30--5:30 pm on Tuesday, 16 February 2016, in Baker Hall A51~~- Due to winter travel delays, the talk with take place at 4:30 on Wednesday the 17th, room TBA

As always, the talk is free and open to the public.

Posted at February 11, 2016 20:11 | permanent link

Attention conservation notice:Only of interest if you (1) care allocating precise fractions of a whole belief over a set of mathematical models when you know none of them is actually believable, and (2) will be in Pittsburgh on Monday.

As someone who thinks Bayesian inference is only worth considering under mis-specification, next week's first talk is of intense interest.

- Jeff Miller, "Robust Bayesian inference via coarsening" (arxiv:1506.06101)
*Abstract:*The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure, particularly when the data set is large. We introduce a simple, coherent approach to Bayesian inference that improves robustness to small departures from the model: rather than conditioning on the observed data exactly, one conditions on the event that the model generates data close to the observed data, with respect to a given statistical distance. When closeness is defined in terms of relative entropy, the resulting "coarsened posterior" can be approximated by simply raising the likelihood to a certain fractional power, making the method computationally efficient and easy to implement in practice. We illustrate with real and simulated data, and provide theoretical results.*Time and place:*4 pm on Monday, 15 February 2016, in 125 Scaife Hall

As always, the talk is free and open to the public.

Posted at February 10, 2016 00:24 | permanent link

Attention conservation notice:Only of interest if (1) you care about factor analysis and Bayesian nonparametrics, and (2) will be in Pittsburgh on Monday.

Constant readers, knowing of my love-hate relationship with both factor analysis and with Bayesian methods will appreciate that the only way I could possibly be more ambivalent about our next seminar was if it also involved power-law distributions.

- Veronika Ročková, "Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity" [preprint, preprint supplement]
*Abstract:*Rotational post-hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys*intermediate*factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations and (c) better oriented sparse solutions. To avoid the pre-specification of the factor cardinality, we extend the loading matrix to have infinitely many columns with the Indian Buffet Process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the Spike-and-Slab LASSO prior, a two-component refinement of the Laplace prior (Rockova 2015). A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional gene expression data, which would render posterior simulation impractical.*Time and place:*4 pm on Monday, 8 February 2016, in 125 Scaife Hall

As always, the talk is free and open to the public.

Posted at February 03, 2016 22:32 | permanent link