Attention conservation notice: Only of interest if you (1) care about large-scale data analysis and/or taxis, and (2) will be in Pittsburgh onThursdayFriday.
The last but by no means least talk seminar talk this week:
As always, the talk is free and open to the public.
Update: Dr. Arnold's talk has been pushed back a day due to travel delays.
Posted at February 25, 2016 11:16 | permanent link
Attention conservation notice: A half-clever dig at one of the more serious and constructive attempts to do something about an important problem that won't go away on its own. It doesn't even explain the idea it tries to undermine.
Jerzy's "cursory overview of differential privacy" post brings back to mind an idea which I doubt is original, but whose source I can't remember. (It's not Baumbauer et al.'s "Fool's Gold: an Illustrated Critique of Differential Privacy" [ssrn/2326746], though they do make a related point about multiple queries.)
The point of differential privacy is to guarantee that adding or removing any one person from the data base can't change the likelihood function by more than a certain factor; that the log-likelihood remains within $\pm \epsilon$. This is achieved by adding noise with a Laplace (double-exponential) distribution to the output of any query from the data base, with the magnitude of the noise being inversely related to the required bound $\epsilon$. (Tighter privacy bounds require more noise.)
The tricky bit is that these $\epsilon$s are additive across queries. If the $i^{\mathrm{th}}$ query can change the log-likelihood by up to $\pm \epsilon_i$, a series of queries can change the log-likelihood by up to $\sum_{i}{\epsilon_i}$. If the data-base owner allows a constant $\epsilon$ per query, we can then break the privacy by making lots of queries. Conversely, if the $\epsilon$ per query is not to be too tight, we can only allow a small number of constant-$\epsilon$ queries. A final option is to gradually ramp down the $\epsilon_i$ so that their sum remains finite, e.g., $\epsilon_i \propto i^{-2}$. This would mean that early queries were subject to little distortion, but latter ones were more and more noisy.
One side effect of any of these schemes, which is what I want to bring out, is that they offer a way to make the database unusable, or nearly unusable, for everyone else. I make the queries I want (if any), and then flood the server with random, pointless queries about the number of cars driven by left-handed dentists in Albuquerque (or whatever). Either the server has a fixed $\epsilon$ per query, and so a fixed upper limit on the number of queries, or $\epsilon$ grows after each query. In the first case, the server has to stop answering others' queries; in the second, eventually they get only noise. Or --- more plausibly --- whoever runs the server has to abandon their differential privacy guarantee.
This same attack would also work, by the way, against the "re-usable holdout". That paper (not surprisingly, given the authors) is basically about creating a testing set, and then answering predictive models' queries about it while guaranteeing differential privacy. To keep the distortion from blowing up, only a limited number of queries can be asked of the testing-set server. That is, the server is explicitly allowed to return NA, rather than a proper answer, and it will always do so after enough questions. In the situation they imagine, though, of the server being a "leaderboard" in a competition among models, the simple way to win is to put in a model early (even a decent model, for form's sake), and then keep putting trivial variants of it in, as often as possible, as quickly as possible. This is because each time I submit a model, I deprive all my possible opponents of one use of the testing set, and if I'm fast enough I can keep them from ever having their models tested at all.
Posted at February 25, 2016 11:09 | permanent link
Attention conservation notice: A ponderous elaboration of an acerbic line by Upton Sinclair. Written so long ago I've honestly forgotten what incident provoked it, then left to gather dust and re-discovered by accident.
A common defense of experts consulting for sometimes nefarious characters in legal cases is that the money isn't corrupting, if the expert happens to agree with the position anyway already. So, for instance, if someone with relevant expertise has doubts about the link between cigarette smoking and cancer, or between fossil-fuel burning and global warming, what harm does it do if they accept money from Philip Morris or Exxon, to defray advocating this? By assumption, they're not lying about their expert opinion.
The problem with this excuse is that it pretends people never change their ideas. When we deal with each other as more-or-less honest people — when we treat what others say as communications rather than as manipulations — we do assume that those we're listening to are telling us things more-or-less as they see them. But we are also assuming that if the way they saw things changed, what they said would track that change. If they encountered new evidence, or even just new arguments, they would respond to them, they would evaluate them, and if they found them persuasive, they would not only change their minds, they would admit that they had done so. (Cf.) We know that can be galling for anyone to admit that they were wrong, but that's part of what we're asking for when we trust experts.
And now the problem with the on-going paid advocacy relationship becomes obvious. It adds material injury to emotional insult as a reason not to admit that one has changed one's mind. The human animal being what it is, this becomes a reason not to change one's mind --- to ignore, or to explain away, new evidence and new argument.
Sometimes the new evidence is ambiguous, the new argument has real weaknesses, and then this desire not to be persuaded by it can perform a real intellectual function, with each side sharpening each other. (You could call this "the cunning of reason" if you wanted to be really pretentious.) But how is the non-expert to know whether your objections are really sound, or whether you are desperately BS-ing to preserve your retainer? Maybe they could figure it out, with a lot of work, but they would be right to be suspicious.
Posted at February 24, 2016 00:26 | permanent link
Attention conservation notice: A failed attempt at a dialogue, combining the philosophical sophistication and easy approachability of statistical theory with the mathematical precision and practical application of epistemology, dragged out for 2500+ words (and equations). You have better things to do than read me vent about manuscripts I volunteered to referee.
Scene: We ascend, by a dubious staircase, to the
garret loft space of Confectioner-Stevedore Hall, at
Robberbaron-Bloodmoney University, where we find two fragments of the author's
consciousness, temporarily incarnated as perpetual post-docs from the
Department of Statistical Data Science, sharing an unheated office.
Q: Are you unhappy with the manuscript you're reviewing?
A: Yes, but I don't see why you care.
Q: The stabbing motions of your pen are both ostentatious and distracting. If I listen to you rant about it, will you go back to working without the semaphore?
A: I think that just means you find it easier to ignore my words than anything else, but I'm willing to try.
Q: So, what is getting you worked up about the manuscript?
A: They take a perfectly reasonable — though not obviously appropriate-to-the-problem — regularized estimator, and then go through immense effort to Bayesify it. They end up with about seven levels of hierarchical priors. Simple Metropolis-Hastings Monte Carlo would move as slowly as a continental plate, so they put vast efforts into speeding it up, and in a real technical triumph they get something which moves like a glacier.
Q: Isn't that rather fast these days?
A: If they try to scale up, my back-of-the-envelope calculation suggests they really will enter the regime where each data set will take a single Ph.D. thesis to analyze.
Q: So do you think that they're just masochists who're into frequentist pursuit, or do they have some reason for doing all these things that annoy you?
A: Their fondness for tables over figures does give me pause, but no, they claim to have a point. If they do all this work, they say, they can use their posterior distributions to quantify uncertainty in their estimates.
Q: That sounds like something statisticians should want to do. Haven't you been very pious about just that, about how handling uncertainty is what really sets statistics apart from other traditions of data analysis? Haven't I heard you say to students that they don't know anything until they know where the error bars go?
A: I suppose I have, though I don't recall that exact phrase. It's not the goal I object to, it's the way quantification of uncertainty is supposed to follow automatically from using Bayesian updating.
Q: You have to admit, the whole "posterior probability distribution over parameter values" thing certainly looks like a way of expressing uncertainty in quantitative form. In fact, last time we went around about this, didn't you admit that Bayesian agents are uncertain about parameters, though not about the probabilities of observable events?
A: I did, and they are, though that's very different from agreeing that they quantify uncertainty in any useful way — that they handle uncertainty well.
Q: Fine, I'll play the straight man and offer a concrete proposal for you to poke holes in. Shall we keep it simple and just consider parametric inference?
A: By all means.
Q: Alright, then, I start with some prior probability distribution over a finite-dimensional vector-valued parameter \( \theta \), say with density \( \pi(\theta) \). I observe \( x \) and have a model which gives me the likelihood \( L(\theta) = p(x;\theta) \), and then my posterior distribution is fixed by \[ \pi(\theta|X=x) \propto L(\theta) \pi(\theta) \] This is my measure-valued estimate. If I want a set-valued estimate of \( \theta \), I can fix a level \( \alpha \) and chose a region \( C_{\alpha} \) with \[ \int_{C_{\alpha}}{\pi(\theta|X=x) d\theta} = \alpha \] Perhaps I even preferentially grow \( C_{\alpha} \) around the posterior mode, or something like that, so it looks pretty. How is \( C_{\alpha} \) not a reasonable way of quantifying my uncertainty about \( \theta \)?
A: To begin with, I don't know the probability that the true \( \theta \in C_{\alpha} \).
Q: How is it not \( \alpha \), like it says right there on the label?
A: Again, I don't understand what that means.
Q: Are you attacking subjective probability? Is that where this is going? OK: sometimes, when a Bayesian agent and a bookmaker love each other very much, the bookie will offer the Bayesian bets on whether \( \theta \in C_{\alpha} \), and the agent will be indifferent so long as the odds are \( \alpha : 1-\alpha \). And even if the bookie is really a damn dirty Dutch gold-digger, the agent can't be pumped dry of money. What part of this do you not understand?
A: I hardly know where to begin. I will leave aside the color commentary. I will leave aside the internal issues with Dutch book arguments for conditionalization. I will not pursue the fascinating, even revealing idea that something which is supposedly a universal requirement of rationality needs such very historically-specific institutions and ideas as money and making book and betting odds for its expression. The important thing is that you're telling me that \( \alpha \), the level of credibility or confidence, is really about your betting odds.
Q: Yes, and?
A: I do not see why should I care about the odds at which you might bet. It's even worse than that, actually, I do not see why I should care about the odds at which a machine you programmed with the saddle-blanket prior (or, if we were doing nonparametrics, an Afghan jirga process prior) would bet. I fail to see how those odds help me learn anything about the world, or even reasonably-warranted uncertainties in inferences about the world.
Q: May I indulge in mythology for a moment?
A: Keep it clean, students may come by.
Q: That leaves out all the best myths, but very well. Each morning, when woken by rosy-fingered Dawn, the goddess Tyche picks \( \theta \) from (what else?) an urn, according to \( \pi(\theta) \). Tyche then draws \( x \) from \( p(X;\theta) \), and \( x \) is revealed to us by the Sibyl or the whisper of oak leaves or sheep's livers. Then we calculate \( \pi(\theta|X=x) \) and \( C_{\alpha} \). In consequence, the fraction of days on which \( \theta \in C_{\alpha} \) is about \( \alpha \). \( \alpha \) is how often the credible set is right, and \( 1-\alpha \) is one of those error rates you like to go on about. Does this myth satisfy you?
A: Not really. I get that "Bayesian analysis treats the parameters as random". In fact, that myth suggests a very simple yet universal Monte Carlo scheme for sampling from any posterior distribution whatsoever, without any Markov chains or burn-in.
Q: Can you say more?
A: I should actually write it up. But now let's try to de-mythologize. I want to know what happens if we get rid of Tyche, or at least demote her from resetting \( \theta \) every day to just picking \( x \) from \( p(x;\theta) \), with \( \theta \) fixed by Zeus.
Q: I think you mean Ananke, Zeus would meddle with the parameters to cheat on Hera. Anyway, what do you think happens?
A: Well, \( C_{\alpha} \) depends on the data, it's really \( C_{\alpha}(x) \). Since \( x \) is random, \( X \sim p(\cdot;\theta) \), so is \( C_{\alpha} \). It follows a distribution of its own, and we can ask about \( Pr_{\theta}(\theta \in C_{\alpha}(X) ) \).
Q: Haven't we just agreed that that probability is just \( \alpha \) ?
A: No, we've seen that \[ \int{Pr_{\theta}(\theta \in C_{\alpha}(X) ) \pi(\theta) d\theta} = \alpha \] but that is a very different thing.
Q: How different could it possibly be?
A: As different as we like, at any particular \( \theta \).
Q: Could the 99% credible sets contain \( \theta \) only, say, 1% of the time?
A: Absolutely. This is the scenario of Larry's playlet, but he wrote that up because it actually happened in a project he was involved in.
Q: Isn't it a bit artificial to worry about the long-run proportion of the time you're right about parameters?
A: The same argument works if you estimate many parameters at once. When the brain-imaging people do fMRI experiments, they estimate how tens of thousands of little regions in the brain ("voxels") respond to stimuli. That means estimating tens of thousands of parameters. I don't think they'd be happy if their 99% intervals turned out to contain the right answer for only 1% of the voxels. But posterior betting odds don't have to have anything to do with how often bets are right, and usually they don't.
Q: Isn't "usually" very strong there?
A: No, I don't think so. D. A. S. Fraser has a wonderful paper, which should be better known, called "Is Bayes Posterior Just Quick and Dirty Confidence", and his answer to his own question is basically "Yes. Yes it is." More formally, he shows that the conditions for Bayesian credible sets to have correct coverage, to be confidence sets, are incredibly restrictive.
Q: But what about the Bernstein-von Mises theorem? Doesn't it say we don't have to worry for big samples, that credible sets are asymptotically confidence sets?
A: Not really. It says that if you have a fixed-dimensional model, and the usual regularity conditions for maximum likelihood estimation hold, so that \( \hat{\theta}_{MLE} \rightsquigarrow \mathcal{N}(\theta, n^{-1}I(\theta)) \), and some more regularity conditions hold, then the posterior distribution is also asymptotically \( \mathcal{N}(\theta, n^{-1}I(\theta)) \).
Q: Wait, so the theorem says that when it applies, if I want to be Bayesian I might as well just skip all the MCMC and maximize the likelihood?
A: You might well think that. You might very well think that. I couldn't possibly comment.
Q: !?!
A: Except to add that the theorem breaks down in the high-dimensional regime where the number of parameters grows with the number of samples, and goes to hell in the non-parametric regime of infinite-dimensional parameters. (In fact, Fraser gives one-dimensional examples where the mis-match between Bayesian credible levels and actual coverage is asymptotically \( O(1) \).) As Freedman said, if you want a confidence set, you need to build a confidence set, not mess around with credible sets.
Q: But surely coverage — "confidence" — isn't all that's needed? Suppose I have only a discrete parameter space, and for each point I flip a coin which comes up heads with probability \( \alpha \). Now my \( C_{\alpha} \) is all the parameter points where the coin came up heads. Its expected coverage is \( \alpha \), as claimed. In fact, if I can come up with a Gygax test, say using the low-significance digits of \( x \), I could invert that to get my confidence set, and get coverage of \( \alpha \) exactly. What then?
A: I never said that coverage was all we needed from a set-valued estimator. It should also be consistent: as we get more data, the set should narrow in on \( \theta \), no matter what \( \theta \) happens to be. Your Gygax sets won't do that. My point is that if you're going to use probabilities ought to mean something, not just refer to some imaginary gambling racket going on in your head.
Q: I am not going to let this point go so easily. It seems like you're insisting on calibration for Bayesian credible sets, that the fraction of them covering the truth be (about) the stated probability, right?
A: That seems like a pretty minimal requirement for treating supposed probabilities seriously. If (as on Discworld) "million to one chances turn up nine times out of ten", they're not really million to one.
Q: Fine — but isn't the Bayesian agent calibrated with probability 1?
A: With subjective probability 1. But failure of calibration is actually typical or generic, in the topological sense.
Q: But maybe the world we live in isn't "typical" in that weird sense of the topologists?
A: Maybe! The fact that Bayesian agents put probability 1 on the "meager" set of sample paths where they are calibrated implies that lots of stochastic processes are supported on topologically-atypical sets of paths. But now we're leaning a lot on a pre-established harmony between the world and our prior-and-model.
Q: Let me take another tack. What if calibration typically fails, but typically fails just a little — say probabilities are really \( p(1 \pm \epsilon ) \) when we think they're \( p \). Would you be very concerned, if \( \epsilon \) were small enough?
A: Honestly, no, but I have no good reason to think that, in general, approximate calibration or coverage is much more common that exact calibration. Anyway, we know that credible probabilities can be radically, dismally off as coverage probabilities, so it seems like a moot point.
Q: So what sense do you make of the uncertainties which come out of Bayesian procedures?
A: "If we started with a population of guesses distributed like this, and then selectively bred them to match the data, here's the dispersion of the final guesses."
Q: You don't think that sounds both thin and complicated?
A: Of course it's both. (And it only gets more complicated if I explain "selective breeding" and "matching the data".) But it's the best sense I can make, these days, of Bayesian uncertainty quantification as she is computed.
Q: And what's your alternative?
A: I want to know about how differently the experiment, the estimate, could have turned out, even if the underlying reality were the same. Standard errors — or median absolute errors, etc. — and confidence sets are about that sort of uncertainty, about re-running the experiment. You might mess up, because your model is wrong, but at least there's a sensible notion of probability in there, referring to things happening in the world. The Bayesian alternative is some sort of sub-genetic-algorithm evolutionary optimization routine you are supposedly running in your mind, while I run a different one in my mind, etc.
Q: But what about all the criticisms of p-values and null hypothesis significance tests and so forth?
A: They all have Bayesian counterparts, as people like Andy Gelman and Christian Robert know very well. The difficulties aren't about not being Bayesian, but about things like testing stupid hypotheses, not accounting for multiple testing or model search, selective reporting, insufficient communication, etc. But now we're in danger of drifting really far from our starting point about uncertainty in estimation.
Q: Would you sum that up then?
A: I don't believe the uncertainties you get from just slapping a prior on something, even if you've chosen your prior so the MAP or the posterior mean matches some reasonable penalized estimator. Give me some reason to think that your posterior probabilities have some contact with reality, or I'll just see them as "quick and dirty confidence" — only often not so quick and very dirty.
Q: Is that what you're going to put in your referee report?
A: I'll be more polite.
Disclaimer: Not a commentary on any specific talk, paper, or statistician. One reason this is a failed attempt at a dialogue is that there is more Q could have said in defense of the Bayesian approach, or at least in objection to A. (I take some comfort in the fact that it's traditional for characters in dialogues to engage in high-class trolling.) Also, the non-existent Robberbaron-Bloodmoney University is not to be confused with the very real Carnegie Mellon University; for instance, the latter lacks a hyphen.
Posted at February 21, 2016 20:23 | permanent link
As always, the talk is free and open to the public.
Posted at February 20, 2016 20:47 | permanent link
Attention conservation notice: Only of interest if you (1) care about evidence on how inequality matters for health, and (2) will be in Pittsbrugh on Tuesday.
As always, the talk is free and open to the public.
Posted at February 20, 2016 20:46 | permanent link
Attention conservation notice: Only of interest if you (1) want to do high-dimensional regressions without claiming lots of discoveries which turn out to be false, and (2) will be in Pittsburgh on Monday.
"But Cosma", I hear you asking, "how can you be five talks into the spring seminar series without having had a single talk about false discovery rate control? Is the CMU department feeling quite itself?" I think you for your concern, and hope this will set your mind at ease:
As always, the talk is free and open to the public.
Posted at February 20, 2016 20:45 | permanent link
Attention conservation notice: A distinguished but elderly scientist philosophizes in public.
As a Judea Pearl fanboy, it is inevitable that I would help promote this:
As always, the talk is free and open to the public.
ObLinkage 1: Clark Glymour, "We believe in freedom of the will so that we can learn".
ObLinkage 2: Mightn't that "illusion of free will" be the only sort
worth wanting?
Posted at February 15, 2016 16:49 | permanent link
Attention conservation notice: Only of interest if you (1) care about statistical methods for causal inference, and (2) will be in Pittsburgh on Thursday.
There are whole books on causal inference which make it seem like the subject is exhausted by comparing the effect of The Treatment to the control condition. (cough Imbens and Rubin cough) But any approach to causal inference which can't grasp a dose-response curve might be sound but is not complete. Nor is there any reason, in this day and age, to stick to simple regression. Fortunately, we don't have to:
As always, the talk is free and open to the public.
Posted at February 15, 2016 16:31 | permanent link
Attention conservation notice: Only of interest if you (1) wish you could use the power of modern optimization to allocate an online advertising budget, and (2) will be in Pittsburgh on Tuesday.
I can think of at least two ways an Internet Thought Leader could make a splash by combining Tuesday's seminar with Maciej Ceglowski's "The Advertising Bubble". Fortunately for all concerned, I am not an Internet Thought Leader.
As always, the talk is free and open to the public.
Posted at February 11, 2016 20:11 | permanent link
Attention conservation notice: Only of interest if you (1) care allocating precise fractions of a whole belief over a set of mathematical models when you know none of them is actually believable, and (2) will be in Pittsburgh on Monday.
As someone who thinks Bayesian inference is only worth considering under mis-specification, next week's first talk is of intense interest.
As always, the talk is free and open to the public.
Posted at February 10, 2016 00:24 | permanent link
Attention conservation notice: Only of interest if (1) you care about factor analysis and Bayesian nonparametrics, and (2) will be in Pittsburgh on Monday.
Constant readers, knowing of my love-hate relationship with both factor analysis and with Bayesian methods will appreciate that the only way I could possibly be more ambivalent about our next seminar was if it also involved power-law distributions.
As always, the talk is free and open to the public.
Posted at February 03, 2016 22:32 | permanent link