Steve was an inspiration to me long before I came to CMU, when he was just a name on the page. When I did meet him, he was only more impressive. He seemed to know everything and everyone, to be interested in everything, to have thought seriously about it at all, and to have boundless and infectious energy for all his projects. He was equally at home discussing the intricacies of algebraic statistics, the influence of Rashevsky on the development of social network analysis, or the history of the US Census's racial classifications. He got involved in a huge range of scientific and scholarly areas, always with exemplary seriousness of really engaging with their substance and their practitioners, not just "consulting". Much of his activity revolved around public service, trying to help make sure policy was more informed, more enlightened, and more just.
Being Steve's colleague was a pleasure and a privilege. We will not see his like again.
Posted at December 14, 2016 13:46 | permanent link
Wikipedia is a tremendous accomplishment and an invaluable resource. It is also highly unreliable. Since I have just spent a bit of time on the second fork, let me record it here for posterity.
A reader of my notebook on information theory wanted to know whether I made a mistake there when I said that "self-information" is, in information theory, just an alternative name for the entropy of a random variable. After all, he said, the Wikipedia article on self-information (version of 22 July 2016) says that the self-information of an event (not a random variable) is the negative log probability of that event*. What follows is modified from my reply to my correspondent.
(1) my usage is the one I learned from my teachers and textbooks; (2) the Wikipedia page is the first time I have ever seen this other usage; and (3) the references given by the Wikipedia page do not actually support the usage it advocates; only one of them even uses the term "self-information", and that supports my usage, rather than the page's.
To elaborate on (3), the Wikipedia page cites as references (a) a paper by Meila on comparing clusterings, (b) Cover and Thomas's standard textbook, and (c) Shannon's original paper. (a) is a good paper, but in fact never uses the phrase "self-information" (or "self information", etc.). For (b), the Wikipedia page cites p. 20 of the first edition from 1991, which I no longer have; but in the 2nd edition, "self-information" appears just once, on p. 21, as a synonym for entropy ("This is the reason that entropy is sometimes referred to as self-information"; their italics). As for (c), "self-information" does not appear anywhere in Shannon's paper (nor, more remarkably, does "mutual information"), and in fact Shannon gives no name to the quantity \( -\log{p(x)} \).
There are also three external links on the page: the first ("Examples of surprisal measures") only uses the word "surprisal". The second, " 'Surprisal' entry in a glossary of molecular information theory", again only uses the word "surprisial" (and that glossary has no entry for "self-information"). The third, "Bayesian Theory of Surprise", does not use either word, and in fact defines "surprise" as the KL divergence between a prior and a posterior distribution, not using $-\log{p(x)}$ at all. The Wikipedia page is right that $-\log{p(x)}$ is sometimes called "surprisal", though "negative log likelihood" is much more common in statistics, and some more mathematical authors (e.g., R. M. Gray, Entropy and Information Theory [2nd ed., Springer, 2011], p. 176) prefer "entropy density". But, as I said, I have never seen anyone else call it "self-information". I am not sure where this strange usage began, but I suspect it's something some Wikipedian just made up. The error seems to go back to the first version of the page on self-information, from 2004 (which cites no references or sources at all). It has survived all 136 subsequent revisions. None of those revisions, it appears, ever involved checking whether the original claim was right, or indeed even whether the external links and references actually supported it.
I could, of course, try to fix this myself, but it would involve replacing the page with something about one sentence long, saying "In information theory, 'self-information' is a synonym for the entropy of a random variable; it is the expected value of the 'surprisal' of a random event, but is not the same as the surprisal." Leaving aside the debate about whether a topic which can be summed up in a sentence deserves a page of its own; I am pretty certain that if I didn't waste a lot of time defending the edit, it would swiftly be reverted. I have better things to do with my time**
How many other Wikipedia pages are based on similar mis-understandings and inventions, I couldn't begin to say. Nor could I pretend to guess whether Wikipedia has more such errors than traditional encyclopedias.
*: The (Shannon) entropy of a random variable \( X \), with probability mass function \( p(x) \), is of course just \( H[X] \equiv -\sum_{x}{p(x) \log{p(x)}} \). The conditional entropy of one random variable \( Y \) given a particular value of another is just the entropy of the conditional distribution, \( H[Y|X=x] \equiv -\sum_{y}{p(y|x) \log{p(y|x)}} \). The conditional entropy is the average of this, \( H[Y|X] \equiv -\sum_{x}{p(x) H[Y|X=x] } \). The information \( X \) contains about \( Y \) is the (average) amount by which conditioning on \( X \) reduces the entropy of \( Y \), \( I[X;Y] \equiv H[Y] - H[Y|X] \). It turns out that this is always equal to \( H[X] - H[X|Y] = I[Y;X] \), hence "mutual information". The term "self-information" is sometimes used by contrast for \( H[X] \), which you can convince yourself is also equal to \( I[X;X] \). Wikipedia, by contrast, is claiming that "self-information" refers to the quantity \( -\log{p(x)} \), so it's a property of a particular outcome or event \( x \), rather than of a probability distribution or random variable. ^
**: I realize that I may still have enough of an online reputation that by posting this, others will fix the article and keep it fixed. ^
Posted at August 28, 2016 12:16 | permanent link
Attention conservation notice: I have no taste.
Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Tales of Our Ancestors; Minds, Brains, and Neurons; The Beloved Republic; Enigmas of Chance; Writing for Antiquity; The Commonwealth of Letters; Islam
Posted at July 31, 2016 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Writing for Antiquity; Physics; Biology; The Great Transformation; Psychoceramica; Enigmas of Chance
Posted at June 30, 2016 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Tales of Our Ancestors; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Writing for Antiquity; Cthulhiana
Posted at May 31, 2016 23:59 | permanent link
Attention conservation notice:: An academic promoting his own talk. Even if you can get past that, only of interest if you (1) care about statistical methods for comparing network data sets, and (2) will be in Seattle on Friday.
Since the coin came up heads, I ought to mention I'm giving a talk at the end of the week:
Posted at May 04, 2016 23:59 | permanent link
Attention conservation notice: Only of interest if you (1) care about running large simulations which are actually good for something, and (2) will be in Pittsburgh on Tuesday.
As always, the talk is free and open to the public.
Posted at May 04, 2016 03:00 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; Scientifiction and Fantastica; Tales of Our Ancestors; Writing for Antiquity; The Great Transformation; Heard About Pittsburgh, PA; Commit a Social Science; Biology; Physics; Complexity; The Dismal Science
Posted at April 30, 2016 23:59 | permanent link
Posted at April 20, 2016 09:50 | permanent link
Attention conservation notice: Self-promotion, and irrelevant unless you (1) will be a student at Carnegie Mellon in the fall, or (2) have a morbid curiosity about a field in which the realities of social life are first caricatured into an impoverished formalism of dots and lines, devoid even of visual interest and incapable of distinguishing the real process of making movies from a mere sketch of the nervous system of a worm, and then further and further abstracted into more and more recondite stochastic models, all expounded by someone who has never himself taken a class in either social science or any of the relevant mathematics.
Two, new, half-semester courses for the fall:
720 is targeted at first-year graduate students in statistics and related fields, but is open to everyone, even well-prepared undergrads. Those more familiar with social networks who want to learn about modeling are also welcome, but should probably check with me first. 781 is deliberately going to demand rather more mathematical maturity. Auditors are welcome in both classes.
Posted at April 19, 2016 16:00 | permanent link
My fifth Ph.D. student is defending his thesis towards the end of the month:
Posted at April 15, 2016 12:00 | permanent link
Attention conservation notice: Note the date.
Any intelligent and well-intentioned person should have a huge, even over-riding preference for leaving existing social and political institutions and hierarchies alone, just because they are the existing ones.
Obviously this can't rest on any presumption that existing institutions are very good, or very wise, or embody any particularly precious values, or are even morally indifferent. They are not. It would also be stupid to appeal to some sub-Darwinian notion that our institutions, just because they have come down to us, and so must have survived an extensive process of selection, are therefore adaptive. At best, that would show the institutions were good at reproducing themselves from generation to generation, not that they had any human or ethical merit. In any case the transmission of any tradition by human beings is inevitably partial and re-interpretive, and so we have no reason to defer to tradition as such.
Stare decisis conservatism rests instead on much less cosy grounds: However awful things are now, they could always be worse, and humanity is both too dumb to avoid making things worse, and too mean to want to avoid making things worse even when it could.
The point about stupidity is elemental. If someone complains that an existing institution is unjust (or unfair, oppressive, etc.), their complaint only has force if a more just alternative is possible. (Otherwise, take it up with the Management.) But it only has political force if that more just alternative is not only possible, but we can figure out what it is. This, we a signally unsuited to do. Social science can tell us many interesting things, but on the most crucial questions of "What will happen if we do this?", we get either dogmatic, experimentally-falsified ideology (economics), or everything-is-obvious-once-you-know-the-answers just-so myths (every other branch of social science). "Try it, and see what happens" is the outer limit of social-scientific wisdom. This is no basis on which to erect a reliable social engineering, or even social handicrafts. When we try to deliberately change our institutions, we are, at best, guided by visions, endemic and epidemic superstitions, evidence-based haruspicy, and the academic version of looking at a list of random words and declaring they all relate to motel service. We have no basis to think that our reforms, if we can even implement them, will rectify the injustice that first aroused our ire, our pity, or our ambition, much less that the attempt won't create even worse problems.
Even getting our pet reform implemented is often going to be hopeless, because so much of our collective knowledge about how to get things done, socially, is tacit. That knowledge is not anything which its holders can put into words, or into a computer, much less into a schedule of prices, but is rather buried in their habits and inarticulate skills. Often these are the habits and skills of a very small number of crucially-placed people, who are, not so coincidentally, vested in the existing institutions and complicit in the existing injustices. Even more, these are habits and skills which only work in a particular environment, usually a social environment. The same people, asked to make a modified institution work, will be less effective, even hopeless. Throwing the bums out gets rid of the people who knew how to get things done.
Finally, and most crucially, think about what happens when existing institutions and arrangements are disturbed. Social life is always full of a clash of conflicting interests. (One of the few things the economists have right is that inside every positive-sum interaction, there is a negative-sum struggle over who gets the gains from cooperation.) When an institution seems settled, eternal, it fades from view, nobody fights over it. Its harsher lines may be softened by compassion (and condescension) on the side of those it advantages, or local and unofficial accommodations and arrangements, or even just from it being too much trouble to exploit it to the hilt. But question the institution, disturb it, make it obvious that there is something to fight over, and what happens? Those who gain from the injustice won't give it up merely because that would be right. Instead, they will press to keep what they have --- and even to claim more. Since this has become an open conflict of power, what emerges is not going to favor the lowly, poor and the weak. Or if that area of social life should, for a time, descend into chaos, well, the tyranny of structurelessness is real, and those who benefit from it are, again, those who are already advantaged, and willing to exploit those advantages. Things might be very different if people were able to agree on justice, and willing to follow it, but they are not.
To recapitulate: People are foolish, selfish and cruel. This means that our institutions are always grossly unjust. But it also means that we don't know how to really make things better. It further means that trying to change anything turns it into a battlefield, where nothing good happens to anybody, least of all the weak and oppressed. Since our current institutions are at least survivable (proof: we've survived them), it's better to leave them alone. They'll change anyway, and that will cause enough grief, without deliberately courting more by ignorant meddling.
Of course, people who actually defend inherited institutions and arrangements just because they're inherited, — such people can usually be counted on the fingers of one fist. Corey Robin would argue — and he has a case — that the impulse behind most actually-existing conservatism is a positive liking for hierarchy. This was an attempt at trying to construct a case for conservatism which would employ all three of Hirschman's tropes of reactionary rhetoric, but also wouldn't fall apart at the first skeptical prod. (Readers who point me at Hayek will be ignored; readers who point me at "neo-reactionaries" will be mocked.) What I have written is still an assembly of fallacies, half-truths and hyperboles, but I flatter myself it would still stand a little inspection.
Posted at April 01, 2016 00:01 | permanent link
Attention conservation notice: I have no taste.
The possible advantage of the frequentist approach [over the Bayesian] is that it avoids the need to specify the prior distribution $ p(\theta) $ for the parameters governing the joint distribution of the two potential outcomes. However, this does not come without cost. Nearly always one has to rely on large sample approximations to justify the derived frequentist confidence intervals. But in large samples, by the Bernstein-Von Mises Theorem (e.g., Van Der Vaart, 1998), the practical implications of the choice of prior distribution is limited, and the alleged benefits of the frequentist approach vanish.I don't see how to unpack everything objectionable in these few sentences without rehearsing the whole of this post, and adding "the bootstrap is a thing, you know".
They didn't appear to use any kind of analytical reasoning to confirm their conjectures, employing instead a crude form of experimental Darwinism, seeding a matrix with algorithms modelling variations of their initial assumptions and letting them run to a halting state, selecting those that most resembled the observed conditions, and running and re-running everything over and over again until they had derived an algorithm that reproduced reality to an agreed level of statistical confidence. The wizards didn't care that this method gave no insights into the problems it attacked, or that they didn't understand how the solutions it yielded were related to the vast edifice of Euclidean mathematical theory. They weren't interested in theory. As far as they were concerned, if an algorithm gave the right answer, then plug it in: it was good to go.
The center was not holding. It was a country of bankruptcy notices and public-auction announcements and commonplace reports of causal killings and misplaced children and abandoned homes and vandals who misspelled even the four-letter words they scrawled. It was a country in which families routinely disappeared, trailing bad checks and repossession papers. Adolescents drifted from city to torn city, sloughing off both the past and the future as snakes shed their skins, children who were never taught and would never now learn the games that had held the society together. People were missing. Children were missing. Parents were missing. Those left behind filed desultory missing-persons reports, then moved on themselves.Of course, Didion goes on:
It was not a country in open revolution. It was not a country under enemy siege. It was the United States of America in the cold late spring of 1967, and the market was steady and the G.N.P. high and a great many articulate people seemed to have a sense of high social purpose and it might have been a spring of brave hopes and national promise, but it was not, and more and more people had the uneasy apprehension that it was not. All that seemed clear was that at some point we had aborted ourselves and butchered the job...
Books to Read While the Algae Grow in Your Fur;
Scientifiction and Fantastica;
Pleasures of Detection, Portraits of Crime;
Tales of Our Ancestors;
Writing for Antiquity;
The Dismal Science;
The Great Transformation;
Enigmas of Chance;
Constant Conjunction Necessary Connexion;
The Continuing Crises;
The Beloved Republic;
Linkage
Posted at March 31, 2016 23:59 | permanent link
Attention conservation notice: Only of interest if you (1) care about the quantitative history of English novels, and (2) will be in Pittsburgh at the end of the month.
I had nothing to do with making this happen — Scott Weingart did — but when the seminar gods offer me something this relevant to my interests, it behooves me to promote it:
As always, the talk is free and open to the public.
Writing for Antiquity; The Commonwealth of Letters; Enigmas of Chance
Posted at March 19, 2016 20:24 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Tales of Our Ancestors; Writing for Antiquity; The Beloved Republic; Cthulhiana; Mathematics; Automata and Mechanical Amusements; Philosophy;
Posted at February 29, 2016 23:59 | permanent link
Attention conservation notice: Only of interest if you (1) care about large-scale data analysis and/or taxis, and (2) will be in Pittsburgh onThursdayFriday.
The last but by no means least talk seminar talk this week:
As always, the talk is free and open to the public.
Update: Dr. Arnold's talk has been pushed back a day due to travel delays.
Posted at February 25, 2016 11:16 | permanent link
Attention conservation notice: A half-clever dig at one of the more serious and constructive attempts to do something about an important problem that won't go away on its own. It doesn't even explain the idea it tries to undermine.
Jerzy's "cursory overview of differential privacy" post brings back to mind an idea which I doubt is original, but whose source I can't remember. (It's not Baumbauer et al.'s "Fool's Gold: an Illustrated Critique of Differential Privacy" [ssrn/2326746], though they do make a related point about multiple queries.)
The point of differential privacy is to guarantee that adding or removing any one person from the data base can't change the likelihood function by more than a certain factor; that the log-likelihood remains within $\pm \epsilon$. This is achieved by adding noise with a Laplace (double-exponential) distribution to the output of any query from the data base, with the magnitude of the noise being inversely related to the required bound $\epsilon$. (Tighter privacy bounds require more noise.)
The tricky bit is that these $\epsilon$s are additive across queries. If the $i^{\mathrm{th}}$ query can change the log-likelihood by up to $\pm \epsilon_i$, a series of queries can change the log-likelihood by up to $\sum_{i}{\epsilon_i}$. If the data-base owner allows a constant $\epsilon$ per query, we can then break the privacy by making lots of queries. Conversely, if the $\epsilon$ per query is not to be too tight, we can only allow a small number of constant-$\epsilon$ queries. A final option is to gradually ramp down the $\epsilon_i$ so that their sum remains finite, e.g., $\epsilon_i \propto i^{-2}$. This would mean that early queries were subject to little distortion, but latter ones were more and more noisy.
One side effect of any of these schemes, which is what I want to bring out, is that they offer a way to make the database unusable, or nearly unusable, for everyone else. I make the queries I want (if any), and then flood the server with random, pointless queries about the number of cars driven by left-handed dentists in Albuquerque (or whatever). Either the server has a fixed $\epsilon$ per query, and so a fixed upper limit on the number of queries, or $\epsilon$ grows after each query. In the first case, the server has to stop answering others' queries; in the second, eventually they get only noise. Or --- more plausibly --- whoever runs the server has to abandon their differential privacy guarantee.
This same attack would also work, by the way, against the "re-usable holdout". That paper (not surprisingly, given the authors) is basically about creating a testing set, and then answering predictive models' queries about it while guaranteeing differential privacy. To keep the distortion from blowing up, only a limited number of queries can be asked of the testing-set server. That is, the server is explicitly allowed to return NA, rather than a proper answer, and it will always do so after enough questions. In the situation they imagine, though, of the server being a "leaderboard" in a competition among models, the simple way to win is to put in a model early (even a decent model, for form's sake), and then keep putting trivial variants of it in, as often as possible, as quickly as possible. This is because each time I submit a model, I deprive all my possible opponents of one use of the testing set, and if I'm fast enough I can keep them from ever having their models tested at all.
Posted at February 25, 2016 11:09 | permanent link
Attention conservation notice: A ponderous elaboration of an acerbic line by Upton Sinclair. Written so long ago I've honestly forgotten what incident provoked it, then left to gather dust and re-discovered by accident.
A common defense of experts consulting for sometimes nefarious characters in legal cases is that the money isn't corrupting, if the expert happens to agree with the position anyway already. So, for instance, if someone with relevant expertise has doubts about the link between cigarette smoking and cancer, or between fossil-fuel burning and global warming, what harm does it do if they accept money from Philip Morris or Exxon, to defray advocating this? By assumption, they're not lying about their expert opinion.
The problem with this excuse is that it pretends people never change their ideas. When we deal with each other as more-or-less honest people — when we treat what others say as communications rather than as manipulations — we do assume that those we're listening to are telling us things more-or-less as they see them. But we are also assuming that if the way they saw things changed, what they said would track that change. If they encountered new evidence, or even just new arguments, they would respond to them, they would evaluate them, and if they found them persuasive, they would not only change their minds, they would admit that they had done so. (Cf.) We know that can be galling for anyone to admit that they were wrong, but that's part of what we're asking for when we trust experts.
And now the problem with the on-going paid advocacy relationship becomes obvious. It adds material injury to emotional insult as a reason not to admit that one has changed one's mind. The human animal being what it is, this becomes a reason not to change one's mind --- to ignore, or to explain away, new evidence and new argument.
Sometimes the new evidence is ambiguous, the new argument has real weaknesses, and then this desire not to be persuaded by it can perform a real intellectual function, with each side sharpening each other. (You could call this "the cunning of reason" if you wanted to be really pretentious.) But how is the non-expert to know whether your objections are really sound, or whether you are desperately BS-ing to preserve your retainer? Maybe they could figure it out, with a lot of work, but they would be right to be suspicious.
Posted at February 24, 2016 00:26 | permanent link
Attention conservation notice: A failed attempt at a dialogue, combining the philosophical sophistication and easy approachability of statistical theory with the mathematical precision and practical application of epistemology, dragged out for 2500+ words (and equations). You have better things to do than read me vent about manuscripts I volunteered to referee.
Scene: We ascend, by a dubious staircase, to the
garret loft space of Confectioner-Stevedore Hall, at
Robberbaron-Bloodmoney University, where we find two fragments of the author's
consciousness, temporarily incarnated as perpetual post-docs from the
Department of Statistical Data Science, sharing an unheated office.
Q: Are you unhappy with the manuscript you're reviewing?
A: Yes, but I don't see why you care.
Q: The stabbing motions of your pen are both ostentatious and distracting. If I listen to you rant about it, will you go back to working without the semaphore?
A: I think that just means you find it easier to ignore my words than anything else, but I'm willing to try.
Q: So, what is getting you worked up about the manuscript?
A: They take a perfectly reasonable — though not obviously appropriate-to-the-problem — regularized estimator, and then go through immense effort to Bayesify it. They end up with about seven levels of hierarchical priors. Simple Metropolis-Hastings Monte Carlo would move as slowly as a continental plate, so they put vast efforts into speeding it up, and in a real technical triumph they get something which moves like a glacier.
Q: Isn't that rather fast these days?
A: If they try to scale up, my back-of-the-envelope calculation suggests they really will enter the regime where each data set will take a single Ph.D. thesis to analyze.
Q: So do you think that they're just masochists who're into frequentist pursuit, or do they have some reason for doing all these things that annoy you?
A: Their fondness for tables over figures does give me pause, but no, they claim to have a point. If they do all this work, they say, they can use their posterior distributions to quantify uncertainty in their estimates.
Q: That sounds like something statisticians should want to do. Haven't you been very pious about just that, about how handling uncertainty is what really sets statistics apart from other traditions of data analysis? Haven't I heard you say to students that they don't know anything until they know where the error bars go?
A: I suppose I have, though I don't recall that exact phrase. It's not the goal I object to, it's the way quantification of uncertainty is supposed to follow automatically from using Bayesian updating.
Q: You have to admit, the whole "posterior probability distribution over parameter values" thing certainly looks like a way of expressing uncertainty in quantitative form. In fact, last time we went around about this, didn't you admit that Bayesian agents are uncertain about parameters, though not about the probabilities of observable events?
A: I did, and they are, though that's very different from agreeing that they quantify uncertainty in any useful way — that they handle uncertainty well.
Q: Fine, I'll play the straight man and offer a concrete proposal for you to poke holes in. Shall we keep it simple and just consider parametric inference?
A: By all means.
Q: Alright, then, I start with some prior probability distribution over a finite-dimensional vector-valued parameter \( \theta \), say with density \( \pi(\theta) \). I observe \( x \) and have a model which gives me the likelihood \( L(\theta) = p(x;\theta) \), and then my posterior distribution is fixed by \[ \pi(\theta|X=x) \propto L(\theta) \pi(\theta) \] This is my measure-valued estimate. If I want a set-valued estimate of \( \theta \), I can fix a level \( \alpha \) and chose a region \( C_{\alpha} \) with \[ \int_{C_{\alpha}}{\pi(\theta|X=x) d\theta} = \alpha \] Perhaps I even preferentially grow \( C_{\alpha} \) around the posterior mode, or something like that, so it looks pretty. How is \( C_{\alpha} \) not a reasonable way of quantifying my uncertainty about \( \theta \)?
A: To begin with, I don't know the probability that the true \( \theta \in C_{\alpha} \).
Q: How is it not \( \alpha \), like it says right there on the label?
A: Again, I don't understand what that means.
Q: Are you attacking subjective probability? Is that where this is going? OK: sometimes, when a Bayesian agent and a bookmaker love each other very much, the bookie will offer the Bayesian bets on whether \( \theta \in C_{\alpha} \), and the agent will be indifferent so long as the odds are \( \alpha : 1-\alpha \). And even if the bookie is really a damn dirty Dutch gold-digger, the agent can't be pumped dry of money. What part of this do you not understand?
A: I hardly know where to begin. I will leave aside the color commentary. I will leave aside the internal issues with Dutch book arguments for conditionalization. I will not pursue the fascinating, even revealing idea that something which is supposedly a universal requirement of rationality needs such very historically-specific institutions and ideas as money and making book and betting odds for its expression. The important thing is that you're telling me that \( \alpha \), the level of credibility or confidence, is really about your betting odds.
Q: Yes, and?
A: I do not see why should I care about the odds at which you might bet. It's even worse than that, actually, I do not see why I should care about the odds at which a machine you programmed with the saddle-blanket prior (or, if we were doing nonparametrics, an Afghan jirga process prior) would bet. I fail to see how those odds help me learn anything about the world, or even reasonably-warranted uncertainties in inferences about the world.
Q: May I indulge in mythology for a moment?
A: Keep it clean, students may come by.
Q: That leaves out all the best myths, but very well. Each morning, when woken by rosy-fingered Dawn, the goddess Tyche picks \( \theta \) from (what else?) an urn, according to \( \pi(\theta) \). Tyche then draws \( x \) from \( p(X;\theta) \), and \( x \) is revealed to us by the Sibyl or the whisper of oak leaves or sheep's livers. Then we calculate \( \pi(\theta|X=x) \) and \( C_{\alpha} \). In consequence, the fraction of days on which \( \theta \in C_{\alpha} \) is about \( \alpha \). \( \alpha \) is how often the credible set is right, and \( 1-\alpha \) is one of those error rates you like to go on about. Does this myth satisfy you?
A: Not really. I get that "Bayesian analysis treats the parameters as random". In fact, that myth suggests a very simple yet universal Monte Carlo scheme for sampling from any posterior distribution whatsoever, without any Markov chains or burn-in.
Q: Can you say more?
A: I should actually write it up. But now let's try to de-mythologize. I want to know what happens if we get rid of Tyche, or at least demote her from resetting \( \theta \) every day to just picking \( x \) from \( p(x;\theta) \), with \( \theta \) fixed by Zeus.
Q: I think you mean Ananke, Zeus would meddle with the parameters to cheat on Hera. Anyway, what do you think happens?
A: Well, \( C_{\alpha} \) depends on the data, it's really \( C_{\alpha}(x) \). Since \( x \) is random, \( X \sim p(\cdot;\theta) \), so is \( C_{\alpha} \). It follows a distribution of its own, and we can ask about \( Pr_{\theta}(\theta \in C_{\alpha}(X) ) \).
Q: Haven't we just agreed that that probability is just \( \alpha \) ?
A: No, we've seen that \[ \int{Pr_{\theta}(\theta \in C_{\alpha}(X) ) \pi(\theta) d\theta} = \alpha \] but that is a very different thing.
Q: How different could it possibly be?
A: As different as we like, at any particular \( \theta \).
Q: Could the 99% credible sets contain \( \theta \) only, say, 1% of the time?
A: Absolutely. This is the scenario of Larry's playlet, but he wrote that up because it actually happened in a project he was involved in.
Q: Isn't it a bit artificial to worry about the long-run proportion of the time you're right about parameters?
A: The same argument works if you estimate many parameters at once. When the brain-imaging people do fMRI experiments, they estimate how tens of thousands of little regions in the brain ("voxels") respond to stimuli. That means estimating tens of thousands of parameters. I don't think they'd be happy if their 99% intervals turned out to contain the right answer for only 1% of the voxels. But posterior betting odds don't have to have anything to do with how often bets are right, and usually they don't.
Q: Isn't "usually" very strong there?
A: No, I don't think so. D. A. S. Fraser has a wonderful paper, which should be better known, called "Is Bayes Posterior Just Quick and Dirty Confidence", and his answer to his own question is basically "Yes. Yes it is." More formally, he shows that the conditions for Bayesian credible sets to have correct coverage, to be confidence sets, are incredibly restrictive.
Q: But what about the Bernstein-von Mises theorem? Doesn't it say we don't have to worry for big samples, that credible sets are asymptotically confidence sets?
A: Not really. It says that if you have a fixed-dimensional model, and the usual regularity conditions for maximum likelihood estimation hold, so that \( \hat{\theta}_{MLE} \rightsquigarrow \mathcal{N}(\theta, n^{-1}I(\theta)) \), and some more regularity conditions hold, then the posterior distribution is also asymptotically \( \mathcal{N}(\theta, n^{-1}I(\theta)) \).
Q: Wait, so the theorem says that when it applies, if I want to be Bayesian I might as well just skip all the MCMC and maximize the likelihood?
A: You might well think that. You might very well think that. I couldn't possibly comment.
Q: !?!
A: Except to add that the theorem breaks down in the high-dimensional regime where the number of parameters grows with the number of samples, and goes to hell in the non-parametric regime of infinite-dimensional parameters. (In fact, Fraser gives one-dimensional examples where the mis-match between Bayesian credible levels and actual coverage is asymptotically \( O(1) \).) As Freedman said, if you want a confidence set, you need to build a confidence set, not mess around with credible sets.
Q: But surely coverage — "confidence" — isn't all that's needed? Suppose I have only a discrete parameter space, and for each point I flip a coin which comes up heads with probability \( \alpha \). Now my \( C_{\alpha} \) is all the parameter points where the coin came up heads. Its expected coverage is \( \alpha \), as claimed. In fact, if I can come up with a Gygax test, say using the low-significance digits of \( x \), I could invert that to get my confidence set, and get coverage of \( \alpha \) exactly. What then?
A: I never said that coverage was all we needed from a set-valued estimator. It should also be consistent: as we get more data, the set should narrow in on \( \theta \), no matter what \( \theta \) happens to be. Your Gygax sets won't do that. My point is that if you're going to use probabilities ought to mean something, not just refer to some imaginary gambling racket going on in your head.
Q: I am not going to let this point go so easily. It seems like you're insisting on calibration for Bayesian credible sets, that the fraction of them covering the truth be (about) the stated probability, right?
A: That seems like a pretty minimal requirement for treating supposed probabilities seriously. If (as on Discworld) "million to one chances turn up nine times out of ten", they're not really million to one.
Q: Fine — but isn't the Bayesian agent calibrated with probability 1?
A: With subjective probability 1. But failure of calibration is actually typical or generic, in the topological sense.
Q: But maybe the world we live in isn't "typical" in that weird sense of the topologists?
A: Maybe! The fact that Bayesian agents put probability 1 on the "meager" set of sample paths where they are calibrated implies that lots of stochastic processes are supported on topologically-atypical sets of paths. But now we're leaning a lot on a pre-established harmony between the world and our prior-and-model.
Q: Let me take another tack. What if calibration typically fails, but typically fails just a little — say probabilities are really \( p(1 \pm \epsilon ) \) when we think they're \( p \). Would you be very concerned, if \( \epsilon \) were small enough?
A: Honestly, no, but I have no good reason to think that, in general, approximate calibration or coverage is much more common that exact calibration. Anyway, we know that credible probabilities can be radically, dismally off as coverage probabilities, so it seems like a moot point.
Q: So what sense do you make of the uncertainties which come out of Bayesian procedures?
A: "If we started with a population of guesses distributed like this, and then selectively bred them to match the data, here's the dispersion of the final guesses."
Q: You don't think that sounds both thin and complicated?
A: Of course it's both. (And it only gets more complicated if I explain "selective breeding" and "matching the data".) But it's the best sense I can make, these days, of Bayesian uncertainty quantification as she is computed.
Q: And what's your alternative?
A: I want to know about how differently the experiment, the estimate, could have turned out, even if the underlying reality were the same. Standard errors — or median absolute errors, etc. — and confidence sets are about that sort of uncertainty, about re-running the experiment. You might mess up, because your model is wrong, but at least there's a sensible notion of probability in there, referring to things happening in the world. The Bayesian alternative is some sort of sub-genetic-algorithm evolutionary optimization routine you are supposedly running in your mind, while I run a different one in my mind, etc.
Q: But what about all the criticisms of p-values and null hypothesis significance tests and so forth?
A: They all have Bayesian counterparts, as people like Andy Gelman and Christian Robert know very well. The difficulties aren't about not being Bayesian, but about things like testing stupid hypotheses, not accounting for multiple testing or model search, selective reporting, insufficient communication, etc. But now we're in danger of drifting really far from our starting point about uncertainty in estimation.
Q: Would you sum that up then?
A: I don't believe the uncertainties you get from just slapping a prior on something, even if you've chosen your prior so the MAP or the posterior mean matches some reasonable penalized estimator. Give me some reason to think that your posterior probabilities have some contact with reality, or I'll just see them as "quick and dirty confidence" — only often not so quick and very dirty.
Q: Is that what you're going to put in your referee report?
A: I'll be more polite.
Disclaimer: Not a commentary on any specific talk, paper, or statistician. One reason this is a failed attempt at a dialogue is that there is more Q could have said in defense of the Bayesian approach, or at least in objection to A. (I take some comfort in the fact that it's traditional for characters in dialogues to engage in high-class trolling.) Also, the non-existent Robberbaron-Bloodmoney University is not to be confused with the very real Carnegie Mellon University; for instance, the latter lacks a hyphen.
Manual trackback: Source-Filter.
Posted at February 21, 2016 20:23 | permanent link
As always, the talk is free and open to the public.
Posted at February 20, 2016 20:47 | permanent link
Attention conservation notice: Only of interest if you (1) care about evidence on how inequality matters for health, and (2) will be in Pittsbrugh on Tuesday.
As always, the talk is free and open to the public.
Posted at February 20, 2016 20:46 | permanent link
Attention conservation notice: Only of interest if you (1) want to do high-dimensional regressions without claiming lots of discoveries which turn out to be false, and (2) will be in Pittsburgh on Monday.
"But Cosma", I hear you asking, "how can you be five talks into the spring seminar series without having had a single talk about false discovery rate control? Is the CMU department feeling quite itself?" I think you for your concern, and hope this will set your mind at ease:
As always, the talk is free and open to the public.
Posted at February 20, 2016 20:45 | permanent link
Attention conservation notice: A distinguished but elderly scientist philosophizes in public.
As a Judea Pearl fanboy, it is inevitable that I would help promote this:
As always, the talk is free and open to the public.
ObLinkage 1: Clark Glymour, "We believe in freedom of the will so that we can learn".
ObLinkage 2: Mightn't that "illusion of free will" be the only sort
worth wanting?
Posted at February 15, 2016 16:49 | permanent link
Attention conservation notice: Only of interest if you (1) care about statistical methods for causal inference, and (2) will be in Pittsburgh on Thursday.
There are whole books on causal inference which make it seem like the subject is exhausted by comparing the effect of The Treatment to the control condition. (cough Imbens and Rubin cough) But any approach to causal inference which can't grasp a dose-response curve might be sound but is not complete. Nor is there any reason, in this day and age, to stick to simple regression. Fortunately, we don't have to:
As always, the talk is free and open to the public.
Posted at February 15, 2016 16:31 | permanent link
Attention conservation notice: Only of interest if you (1) wish you could use the power of modern optimization to allocate an online advertising budget, and (2) will be in Pittsburgh on Tuesday.
I can think of at least two ways an Internet Thought Leader could make a splash by combining Tuesday's seminar with Maciej Ceglowski's "The Advertising Bubble". Fortunately for all concerned, I am not an Internet Thought Leader.
As always, the talk is free and open to the public.
Posted at February 11, 2016 20:11 | permanent link
Attention conservation notice: Only of interest if you (1) care allocating precise fractions of a whole belief over a set of mathematical models when you know none of them is actually believable, and (2) will be in Pittsburgh on Monday.
As someone who thinks Bayesian inference is only worth considering under mis-specification, next week's first talk is of intense interest.
As always, the talk is free and open to the public.
Posted at February 10, 2016 00:24 | permanent link
Attention conservation notice: Only of interest if (1) you care about factor analysis and Bayesian nonparametrics, and (2) will be in Pittsburgh on Monday.
Constant readers, knowing of my love-hate relationship with both factor analysis and with Bayesian methods will appreciate that the only way I could possibly be more ambivalent about our next seminar was if it also involved power-law distributions.
As always, the talk is free and open to the public.
Posted at February 03, 2016 22:32 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; Enigmas of Chance; The Dismal Science; The Collective Use and Evolution of Concepts; Pleasures of Detection, Portraits of Crime
Posted at January 31, 2016 23:59 | permanent link
Attention conservation notice: Only of interest if (1) you care about the intersection of high-dimensional statistics with information theory, and (2) will be in Pittsburgh next Wednesday.
It is, perhaps, only appropriate that the first statistics seminar of the semester is about connections between high-dimensional regression, and limits on how fast information can be sent over noisy channels.
As always, the talk is free and open the public.
Posted at January 27, 2016 18:24 | permanent link
Attention conservation notice: Navel-gazing by an academic.
This was my first time teaching our undergraduate course on linear models ("401"). I've taught the course which follows it (402) four times, and re-designed it once, but I've never had to actually take the students through the pre-req. They come in with courses on probability, on statistical inference, and on linear algebra, but usually no real experience with data analysis. Linear regression is usually their first time trying to connect statistical models to actual data — as well as learning about how linear regression works.
I am OK with how I did, but only about OK. The three big issues I need to work on are (1) connecting theory to practice, (2) getting feedback to students faster, and (3) better assignments.
(1) I feel like I did not strike a good balance, in lecture, between theory, computational examples, and how theory guides practice. The last thing I want to do is turn out people who just (think they) know which commands to run in R, without understanding what's actually going on. (As a student put it to a colleague in a previous semester, "The difference between 401 and econometrics is that in econometrics we have to know how to do all this stuff, and in 401 we also have to know why." This was not, I believe, intended as a compliment.) But based on the student evaluations, and still more the assignments, there're still students who are a bit fuzzy about what "holding all other predictor variables constant" actually means in a linear model. But then again, based on student feedback I persistently have a problem connecting mathematical theory to data-analytic practice; more serious re-thinking of how I teach may be in order.
(2) Students need faster and more consistent feedback on their assignments. We were somewhat constrained on speed this semester by a labor shortage, but I could have done more to ensure consistency across graders.
(3) Too many of the assignments were based on small, old data sets from the textbook. Mea culpa.
This was the first time we had two sections of 401, with two separate professors. I think we did OK at coordinating them, and I take full responsibility for all the failures and glitches. (I should add, because I know some of the students read this, that grades were curved and calculated completely independently across the two sections.)
I am very grateful for the work done on designing the curriculum for this course by my colleagues. Still, I feel like a lot of the course was spent on (to be slightly unfair) special cases which people could work out in closed form in the 1920s, and pretending that they had relevance to actual data analysis. (Cf.) The Kids do need at least a nodding acquaintance with that stuff, because people will expect it of them, but I would rather they be taught it as a nice bonus rather than a default. This would mean a lot more re-design that I put into the course.
Relatedly, I came to have a thorough, almost personal, dislike of the textbook, but that's another story.
Some things which did go well:
I'll indulge myself by ending on on an "achievement unlocked" unlocked note. This was (so far as I know) the first class I've taught where a student's response to one of my lectures was to ask Reddit "Is there any truth to this?". There can be few better proofs that I reached at least one of my students and inspired them to think critically about the material. I am being quite serious when I say that I wish something like this happened every week in every course.
Posted at January 09, 2016 22:38 | permanent link
Attention conservation notice: Only relevant if you are a student at Carnegie Mellon University, or have a pathological fondness for reading lecture notes on statistics.
In the so-called spring, I will again be teaching 36-402 / 36-608, undergraduate advanced data analysis:
The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.
Graduate students from other departments wishing to take this course should register for it under the number "36-608". Enrollment for 36-608 is very limited, and by permission of the professors only.
Prerequisites: 36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-608.
This will be my fifth time teaching 402, and the fifth time where the primary text is the draft of Advanced Data Analysis from an Elementary Point of View. (I hope my editor will believe that I don't intend for my revisions to illustrate Zeno's paradox.) It is the first time I will be co-teaching with the lovely and talented Max G'Sell.
Unbecoming whining: 402 will be larger this year than last, just like it has been every year I've been here. This year, in fact, we'll have over 150 students in it, or about 1/50 of all CMU undergrads. (This has nothing to do with my teaching, and everything to do with our student population.) I think it's great that we're teaching what would be masters-level material at most schools to so many juniors and seniors, but I don't think we'll be able to keep doubling every five years without either having a lot of stuff break, or transforming the nature of the course yet again. It's clearly a better problem to have than "class sizes are halving every five years"*, but it's still a problem.
*: As I have said in a number of conversations over recent years, the nightmare scenario for statistics vs. "data science" is that statistics becomes a sort of mathematical analog to classics. People might pay lip-service to our value, especially people who are invested in pretending to intellectual rigor, but few would actually pay attention to anything we have to say.
Posted at January 09, 2016 22:00 | permanent link