## December 14, 2016

### In Memoriam Stephen E. Fienberg (27 November 1942 -- 14 December 2016)

Steve was an inspiration to me long before I came to CMU, when he was just a name on the page. When I did meet him, he was only more impressive. He seemed to know everything and everyone, to be interested in everything, to have thought seriously about it at all, and to have boundless and infectious energy for all his projects. He was equally at home discussing the intricacies of algebraic statistics, the influence of Rashevsky on the development of social network analysis, or the history of the US Census's racial classifications. He got involved in a huge range of scientific and scholarly areas, always with exemplary seriousness of really engaging with their substance and their practitioners, not just "consulting". Much of his activity revolved around public service, trying to help make sure policy was more informed, more enlightened, and more just.

Being Steve's colleague was a pleasure and a privilege. We will not see his like again.

Posted at December 14, 2016 13:46 | permanent link

## August 28, 2016

### Oh, Wikipedia

Wikipedia is a tremendous accomplishment and an invaluable resource. It is also highly unreliable. Since I have just spent a bit of time on the second fork, let me record it here for posterity.

A reader of my notebook on information theory wanted to know whether I made a mistake there when I said that "self-information" is, in information theory, just an alternative name for the entropy of a random variable. After all, he said, the Wikipedia article on self-information (version of 22 July 2016) says that the self-information of an event (not a random variable) is the negative log probability of that event*. What follows is modified from my reply to my correspondent.

(1) my usage is the one I learned from my teachers and textbooks; (2) the Wikipedia page is the first time I have ever seen this other usage; and (3) the references given by the Wikipedia page do not actually support the usage it advocates; only one of them even uses the term "self-information", and that supports my usage, rather than the page's.

To elaborate on (3), the Wikipedia page cites as references (a) a paper by Meila on comparing clusterings, (b) Cover and Thomas's standard textbook, and (c) Shannon's original paper. (a) is a good paper, but in fact never uses the phrase "self-information" (or "self information", etc.). For (b), the Wikipedia page cites p. 20 of the first edition from 1991, which I no longer have; but in the 2nd edition, "self-information" appears just once, on p. 21, as a synonym for entropy ("This is the reason that entropy is sometimes referred to as self-information"; their italics). As for (c), "self-information" does not appear anywhere in Shannon's paper (nor, more remarkably, does "mutual information"), and in fact Shannon gives no name to the quantity $-\log{p(x)}$.

There are also three external links on the page: the first ("Examples of surprisal measures") only uses the word "surprisal". The second, " 'Surprisal' entry in a glossary of molecular information theory", again only uses the word "surprisial" (and that glossary has no entry for "self-information"). The third, "Bayesian Theory of Surprise", does not use either word, and in fact defines "surprise" as the KL divergence between a prior and a posterior distribution, not using $-\log{p(x)}$ at all. The Wikipedia page is right that $-\log{p(x)}$ is sometimes called "surprisal", though "negative log likelihood" is much more common in statistics, and some more mathematical authors (e.g., R. M. Gray, Entropy and Information Theory [2nd ed., Springer, 2011], p. 176) prefer "entropy density". But, as I said, I have never seen anyone else call it "self-information". I am not sure where this strange usage began, but I suspect it's something some Wikipedian just made up. The error seems to go back to the first version of the page on self-information, from 2004 (which cites no references or sources at all). It has survived all 136 subsequent revisions. None of those revisions, it appears, ever involved checking whether the original claim was right, or indeed even whether the external links and references actually supported it.

I could, of course, try to fix this myself, but it would involve replacing the page with something about one sentence long, saying "In information theory, 'self-information' is a synonym for the entropy of a random variable; it is the expected value of the 'surprisal' of a random event, but is not the same as the surprisal." Leaving aside the debate about whether a topic which can be summed up in a sentence deserves a page of its own; I am pretty certain that if I didn't waste a lot of time defending the edit, it would swiftly be reverted. I have better things to do with my time**

How many other Wikipedia pages are based on similar mis-understandings and inventions, I couldn't begin to say. Nor could I pretend to guess whether Wikipedia has more such errors than traditional encyclopedias.

*: The (Shannon) entropy of a random variable $X$, with probability mass function $p(x)$, is of course just $H[X] \equiv -\sum_{x}{p(x) \log{p(x)}}$. The conditional entropy of one random variable $Y$ given a particular value of another is just the entropy of the conditional distribution, $H[Y|X=x] \equiv -\sum_{y}{p(y|x) \log{p(y|x)}}$. The conditional entropy is the average of this, $H[Y|X] \equiv -\sum_{x}{p(x) H[Y|X=x] }$. The information $X$ contains about $Y$ is the (average) amount by which conditioning on $X$ reduces the entropy of $Y$, $I[X;Y] \equiv H[Y] - H[Y|X]$. It turns out that this is always equal to $H[X] - H[X|Y] = I[Y;X]$, hence "mutual information". The term "self-information" is sometimes used by contrast for $H[X]$, which you can convince yourself is also equal to $I[X;X]$. Wikipedia, by contrast, is claiming that "self-information" refers to the quantity $-\log{p(x)}$, so it's a property of a particular outcome or event $x$, rather than of a probability distribution or random variable. ^

**: I realize that I may still have enough of an online reputation that by posting this, others will fix the article and keep it fixed. ^

Posted at August 28, 2016 12:16 | permanent link

## July 31, 2016

### Books to Read While the Algae Grow in Your Fur, July 2016

Attention conservation notice: I have no taste.

Gordon S. Wood, The Radicalism of the American Revolution
Wood's thesis is that the revolution's radicalism wasn't so much in class struggle as it was in over-throwing the idea of a society of orders, of hierarchical chains of dependence and patronage descending from a monarch through aristocrats all the way down. I am not sure I find this fully convincing; Wood shows lots of examples of hierarchical-dependence before the revolution, lots of examples of its dissolution after, and lots of attacks on it during, but how hard would it be to find comparable anecdotes which don't fit his scheme? Similarly, how hard would it be to find anecdotes which fit Wood's scheme, during these periods, for England, or for that matter for the British colonies in the West Indies? If the answer in both cases is "very hard", then I'd be more persuaded; but that's not something I am competent to assess.
Elsa Hart, Jade Dragon Mountain
Mind candy historical mystery: one part imitation of Judge Dee (from the early Qing, rather than the Tang, and from a lower point of view on the social scale) to one part Arabian Nights; more enjoyable than it has any right to be.
Linda Nagata, Deception Well
This is one of Nagata's republished novels from the 1990s, when she was, quite simply, one of the best hard-SF writers going. It shows all her virtues: elegant writing, rigorous large-scale imagination, a story growing naturally from the setting, and a certain emotional detachment from her characters which does not interfere with the narrative drive. (I have my suspicions as to why Nagata's writing career went into hiatus.)
Peter Sis, The Conference of the Birds
There is something incredibly charming about the thought of a Czech-American illustrator adapting a 4500-line medieval Pesian Sufi epic poem as, essentially, a comic book. It becomes even more charming when it's pulled off very well, as it is here.
Stephen M. Stigler, Seven Pillars of Statistical Wisdom
Stigler is, it's fair to say, the pre-eminent historian of statistics from an "internal", technical-development-of-the-field perspective. This is him explaining where seven key principles came from. I enjoyed it, but I am going to be a jerk and say that this book narrowly but decisively misses being great. The reason is that Stigler's implicit reader is someone who already knows modern statistics. The text goes along happily explaining (for example) why randomized experiments are such a great idea, and then will make references which are incomprehensible unless you've done analysis of variance, remember what "interaction" means in that context, and recall what kind of experimental designs let you get at interactions. I think that with a little more work Stigler could have produced a book which would have actually explained our ideas to non-statisticians, which would have been a triumph. Instead, this is just one for consumption within the tribe.
Ruth Downie, Vita Brevis
Mind-candy historical mystery novel, in which moving from Britannia to Rome to seek one's fortune results only in plot. Enjoyable separately from the rest of the series.
Charles Stross, The Nightmare Stacks
Mind candy contemporary fantasy, in which England is invaded by the Unseelie Host; also in which a nerdy vampire boy meets a manic pixie dream girl with a very evil step-mother, depicted under the light of Stross's take on common notions of romance... This is a really fun book, on multiple levels, and I endorse it strongly as mind candy. Stross has clearly tried to make it an alternate entry-point to the series, though I don't know (having been enjoying the series since the beginning) how comprehensible a new reader really would find the book, especially Alex's situation.
Tim Shallice and Richard P. Cooper, The Organisation of Mind
This is a learned, judicious, rather-comprehensive attempt to synthesize what we have learned about how human minds work, by studying the human brain, especially by studying evidence from selective brain damage (the theme of Shallice's earlier From Neuropsychology to Mental Structure) and from functional brain imaging.
Shallice and Cooper give two early chapters to the assumptions underlying neuropsychological studies, on the one hand, and imaging, on the other, and make the simple (but neglected) point that we should feel a lot more confidence in conclusions which are supported by evidence from both types of studies. They also, soundly, emphasize that both types of studies overwhelmingly rely on tasks, on asking people to do things, and then observing what happens. Drawing conclusions from such experiments thus relies on psychological theories of how people understand the instructions, the processes involved in carrying out the tasks, and the resources and capacities those processes call on. As the book moves on from verbal semantics and short-term memory (especially sensory memory) to complex forms of action and planning, autobiographical memory and abstract thinking, the case studies they consider become increasingly inconclusive and, if not quite mutually contradictory, then at least confused in aggregate. Shallice and Cooper argue, convincingly to me, that this is largely because investigators are not relying on analyses of the tasks which are well-thought-out and widely-agreed-upon task analyses, but rather on ones which are vague, merely-intuitive, or even tacit. They further argue that this is what really needs to be fixed if there is going to be actual scientific progress, rather than a mere assembly-line production of experiments. While this book is from 2010 and so pre-dates projects like NeuroSynth, I think an examination of how that valuable tool gets used (e.g.) would only reinforce their position.
(The next-to-last chapter is about consciousness. While they have some sensible things to say about what a neuropsychological theory of consciousness should try to explain, their account of Deheane et al.'s "global workspace" theory just left me confused, because the Cartesian theater makes no more sense as a distributed network than a localized nodule. It does, however, make the theory sound interesting enough that it goes on the very long list of things to read.)
Over all, I would strongly recommend this to anyone with a serious interest in cognitive neuroscience, some prior acquaintance with at least one of its constituent fields, and the time to read 500 big, densely-printed pages.
Erratum: Despite p. 9, Norbert Wiener was an American, not a Hungarian, mathematician. (This may be the result of confusion with John von Neumann, who was Hungarian-American, and is also mentioned on p. 9.)

Posted at July 31, 2016 23:59 | permanent link

## June 30, 2016

### Books to Read While the Algae Grow in Your Fur, June 2016

Attention conservation notice: I have no taste.

Oliver Morton, The Planet Remade: How Geoengineering Could Change the World
This is just as impressive, enlightening, gracefully-written and thoughtful as I'd expect from the author of Eaters of the Sun. The subtitle might lead you to expect simple boosterism for geoengineering as a response to climate change; nothing could be further from the truth. Instead this is an informative and nuanced look at what forms of geoengineering might be practical, what forms of it have (arguably) taken place (especially notable: a magnificent chapter on the planetary nitrogen cycle, and how it's been changed by artificial fertilizers), and why geoengineering needs to be investigated, rather than either advocated for or against. Morton is quite passionate, and quite right, that these cannot be technical decisions taken outside of politics, and that they will only do good if the decision-making is at once technical and political, about who will gain and lose what, with what justice and what accountability for the (literally) monstrous acts being contemplated --- or for the failure to undertake them.
If you find what I write at all interesting, I can hardly recommend this too highly.
Yoon Ha Lee, Ninefox Gambit
Mind candy military space opera. The conceit that getting human beings to accept the right sort of mathematically-designed ritual calendar could change the local laws of physics, permitting interestingly-gruesome "exotic" effects, is I think new, and used well; the writing is also quite good, though not spectacular. (The convoluted schemes are however too convoluted, or at least it feels like they should suffer more unforeseeable complications.)
Marisha Pessl, Night Film
Mind candy literary thriller, playing at being a horror story. (Or is it?) It's really quite gripping, and surprised me multiple times. I would happily read a lot more like this, but it seems like the sort of book which was a lot of work to write.
(I will gripe very slightly that the first-person narrator doesn't sound altogether convincing as a middle-aged divorced man. Of course he's much more convincing than most attempts by the narrator [or me] to write people like Pessl.)
(I will also gripe, carefully, that Pessl's characters present a lot of myths about Satanic ritual abuse from the 1980s and early 1990s as simple facts.)
Michelle Goldberg, The Goddess Pose: The Audacious Life of Indra Devi, the Women Who Helped Bring Yoga to the West
This is awesome. It's very much a opular book but Goldberg has clearly dived into the scholarly literature about things like the Mughal-era history of yoga, the role of Theosophy, etc. It also helps tremendously that Goldberg doesn't take her subject's self-presentation(s) at face value. This is fascinating stuff even if you don't care about yoga.
(It's also a bit depressing that a woman of such obvious energy, determination, openness to novelty and multi-form abilities found nothing better to do with herself for much of her long life than promote quackery and cults. Whether she, and the world, would have been better off in finance or marketing is, however, a nice question.)
Stephen King, End of Watch
Mind candy. This is the end of a trilogy which began as a mundane (but good!) thriller, and had a second book which was almost entirely free of supernatural elements. This one, however, is just a flat-out horror novel, and its biggest flaw to my mind is that the characters accept the super-natural much too readily. It's still fun, but it's a bit weak.
K. B. Spangler, State Machine
Mind candy science fiction mystery. This is the third Rachel Peng novel; it'll make a lot more sense, and be a lot more enjoyable, if you've read the first two; having done so, I found this outing enjoyable. However, there certain plot elements whose significance will only be clear to readers of Spangler's web comic, which is rather different in tone than these books. I'll be curious to see how she handles this going forward.
Martin Anthony and Peter L. Bartlett, Neural Network Learning: Theoretical Foundations
In the usual way, my brief remarks grew to a full review. Shorter me: it's really a mathematical introduction to statistical learning theory, and still quite good even if you couldn't care less about neural networks.

Posted at June 30, 2016 23:59 | permanent link

## May 31, 2016

### Books to Read While the Algae Grow in Your Fur, May 2016

Attention conservation notice: I have no taste.

Amitav Ghosh, Sea of Poppies, River of Smoke and Flood of Fire
Collectively, "the Ibis trilogy", three historical novels centered around the First Opium War. They're beautifully written and the viewpoint characters (of which there are many, weaving in and out of the three books) are all very well-drawn. Beyond that, the setting and the protagonists give Ghosh a chance to depict — "comment on" suggests something more heavy-handed — imperialism, cultural diversity and exchange, free trade, multiple identities, enough varieties of love that cannot be acknowledged that I'd have to think to list them all, desires ditto, gardening, memory, the perils of getting what you want, and much, much else. It's really impressive, even if I was not very happy with the ending, and I will be revisiting it at a more leisurely pace.
Elizabeth Bear, Karen Memory
Mind candy. I am normally a big fan of Bear's writing, but just got through this one. The central feature of the book is the voice of the first-person narrator, Karen Memery (sic), and while this was clearly a labor of love on the part of the author, my reaction to that voice ranged from indifference to irritation. (The character wasn't irritating, her style was.) As for the steampunk setting --- as my friend Henry Farrell once put it, "the goggles do nothing", i.e., it seemed like it would have been very easy to tell a very similar, and no worse, story without those props. Clearly, though, lots of people like it very much, so I will just look forward to Bear's future books.
Elizabeth Hand, Generation Loss
Mind candy: a mystery or literary thriller (or both?). The writing is excellent and the protagonist, a failed New York photographer very much out of her element in Maine, is a very well-realized character (and a complete jerk, with impulses which are much, much worse). There are apparently sequels, which I look forward to tracking down. ROT-13'd, for being both a spoiler and catty: Gubhtu V qb ubcr Pnff trgf orggre nobhg fbyivat zlfgrevrf guna whfg orvat yhpxl jura fur qrpvqrf fbzrbar fbhaqf qhovbhf.
(Picked up on the recommendation of Aunt Agatha's in Ann Arbor.)
Ada Palmer, Too Like the Lightning
This is a deeply impressive effort to take seriously the line that "history is the trade secret of science fiction". That is, Palmer has tried to craft a 25th century which is as strange, as familiar, and as both-at-once-because-that's-not-what-we-meant, as our own time would have seemed to someone from the 17th century. This applies not just to the world-building but also to the story-telling (e.g., the way her narrator is simultaneously speaking to his own future and trying to channel [what he thinks of as] an 18th-century voice). This is, to my mind, exactly the sort of thing good science fiction should do. I hope the example of the effort catches on, though I worry that it will merely be specific inventions which get imitated.
Having enthused about setting and narration, I have to admit to being more ambivalent about the plot. Or maybe plots; there are at least two, one revolving around the high politics of the world, and the other around a young boy who seems to have miraculous powers. Both are hard to summarize, or even describe, and both are left very much unresolved at the end of this book. I find it hard to say whether I like the story, though I was certainly eager enough to keep reading, and am frustrated enough by not knowing what happened next that I pre-ordered the sequel.
ObLinkage: Palmer's round-up of her self-presentations and reviews by others.
Robert Jackson Bennett, American Elsewhere
Mind-candy contemporary fantasy, but of truly exceptional quality. This is in many ways a meditation Lovecraftian themes, transposed to the Southwest and rationalized with "because of quantum" (*). (Spoiler-proofed discussion below.) But it's not just yet another re-hashing of monsters and tropes from Lovecraft, which would only matter for those who are already fans of that micro-genre. Rather it's a work of genuine artistry and originality, as well as a hell of a lot of fun. The only real point at which I see a failure, or at least a lost opportunity, is that if you are going to tell a story which revolves around physicists in northern New Mexico unleashing something monstrous, you really should engage more with the reality of Los Alamos... But, again, as entertainment this is just remarkably good.
Discussion of the Lovecraftian connections, ROT-13'd for spoilers: Oraargg znxrf ab hfr bs Ybirpensg'f fcrpvsvp zbafgref be cebcf. (Gurer ner n srj zragvbaf bs syhgvat, naq n guebj-njnl nobhg "jura gur fgnef nyvta", juvpu frrz yvxr qryvorengr ersreraprf, ohg nera'g pbafrdhragvny.) Gur gehyl Ybirpensgvna ovgf pbzr va jvgu gur crbcyr va Jvax sebz ryfrjurer, v.r., bgure qvzrafvbaf jvgu qvssrerag culfvpny ynjf. (Gur ivfvbaf jr pngpu bs jung gurve jbeyq vf yvxr ner rrevr.) Gurve gehr nccrnenaprf unir gur hfhny pbzcyrzrag bs gragnpyrf naq gur yvxr, ohg zber vzcbegnagyl gurl ner napvrag, vauhzna orvatf bs vaperqvoyr cbjre, jubfr gehr sbezf naq angherf ner zber guna gur beqvanel uhzna zvaq pna fgnaq gb nffvzvyngr. Gurl unir pbadhrerq znal jbeyqf, naq orra nf tbqf gurer, ohg urer gurl ner zber be yrff uvqqra. Naq lrg gurl ner pbzcyrgryl vagrejbira vagb gur uhzna yvsr bs gur vqlyyvp gbja bs Jvax, juvpu fvzcyl jbhyq abg rkvfg jvgubhg gurz. Crbcyr pbzr gb zber be yrff qvfgheovat evghny neenatrzragf jvgu gurfr cbjref, jvgubhg rirel dhvgr orvat ubarfg jvgu gurzfryirf nobhg jung gurl ner qbvat. Naq gur urebvar svaqf guvf jubyr jbeyq obgu ubeevoyr naq snfpvangvat, naq gura qvfpbiref gung fur vf npghnyyl cneg bs vg, zhpu zber vagvzngryl guna nalbar ryfr va Jvax; gung gur ivrjcbvag punenpgre vf va snpg bar bs gur zbafgref vf n irel Ybirpensgvna gbhpu. Gur ybj-yvsr pevzvanyf ba gur obeqref bs Jvax, gur ebyr bs Zban'f onol, naq gur pyvznpgvp qrfgehpgvba bs gur gbja, ba gur bgure unaq, nyy frrz zber yvxr Fgrcura Xvat.
Query: Do Wink and Night Vale communicate?
*: Whereas in Lovecraft they were rationalized with "because of relativity".
Michael Alan Williams, Rethinking "Gnosticism": An Argument for Dismantling a Dubious Category
I first read this by chance in graduate school, when it seemed to me a really good demonstration of how to do properly critical and skeptical, but not nihilistic, intellectual history. (Among other things, this includes admitting when the social component of such history is largely guesswork.)
On re-reading after fifteen (!) years, I still find it the main thesis largely persuasive. To attempt my own summary: the ancient sources which modern scholars label "gnostic" are united neither by clear evidence of a shared tradition or organization, nor even by the reports of the orthodox heresiologists; the supposed "anti-cosmic" attitude, forced alternative of either extreme asceticism or licentiousness, etc., are not supported by the texts (and the latter is a bog-standard accusation of the orthodox against everyone), and seem to be largely modern constructions or interpretations; and that, in short, it would be better to chuck the whole category of "gnosticism" in favor of clearer and more empirical ones, like "biblical demuirgical traditions". (Though Williams doesn't harp on this, one could then investigate, rather than pre-judge, questions like "did all texts with biblical demiurgical myths share a common origin?" and "what range of attitudes towards the human body are shown in such texts?".) I do feel like this would have been even stronger had it included an account of how the modern concept of gnosticism had evolved. I also, of course, feel like I really shouldn't pronounce anything until reading the counter-arguments, but naturally I haven't taken the time to track them down...

Posted at May 31, 2016 23:59 | permanent link

## May 04, 2016

### "Nonparametric Estimation and Comparison for Networks" (Friday at U. Washington)

Attention conservation notice:: An academic promoting his own talk. Even if you can get past that, only of interest if you (1) care about statistical methods for comparing network data sets, and (2) will be in Seattle on Friday.

Since the coin came up heads, I ought to mention I'm giving a talk at the end of the week:

"Nonparametric Estimation and Comparison for Networks", UW-Seattle statistics dept. seminar
Abstract: Scientific questions about networks are often comparative: we want to know whether the difference between two networks is just noise, and, if not, how their structures differ. I'll describe a general framework for network comparison, based on testing whether the distance between models estimated from separate networks exceeds what we'd expect based on a pooled estimate. This framework is especially useful with nonparametric network models, such as densities of latent node locations, or continuous generalizations of block models ("graphons"); the estimation methods for those models also let us generate surrogate data, predict links, and summarize structure.
(Joint work with Dena Asta, Chris Genovese, Brian Karrer, Andrew Thomas, and Lawrence Wang.)
Time and place: 3:30--4:30 pm on Friday, 6 May 2016, in SMI 211, UW-Seattle

Posted at May 04, 2016 23:59 | permanent link

### "Partitioning a Large Simulation as It Runs" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about running large simulations which are actually good for something, and (2) will be in Pittsburgh on Tuesday.
Kary Myers, "Partitioning a Large Simulation as It Runs" (Technometrics forthcoming)
Abstract: As computer simulations continue to grow in size and complexity, they present a particularly challenging class of big data problems. Many application areas are moving toward exascale computing systems, systems that perform $10^{18}$ FLOPS (FLoating-point Operations Per Second) --- a billion billion calculations per second. Simulations at this scale can generate output that exceeds both the storage capacity and the bandwidth available for transfer to storage, making post-processing and analysis challenging. One approach is to embed some analyses in the simulation while the simulation is running --- a strategy often called in situ analysis --- to reduce the need for transfer to storage. Another strategy is to save only a reduced set of time steps rather than the full simulation. Typically the selected time steps are evenly spaced, where the spacing can be defined by the budget for storage and transfer. Our work combines both of these ideas to introduce an online in situ method for identifying a reduced set of time steps of the simulation to save. Our approach significantly reduces the data transfer and storage requirements, and it provides improved fidelity to the simulation to facilitate post-processing and reconsruction. We illustrate the method using a computer simulation that supported NASA's 2009 Lunar Crater Observation and Sensing Satellite mission.
Time and place: 4--5 pm on Tuesday, 10 May 2016, in Baker Hall 235B

As always, the talk is free and open to the public.

Posted at May 04, 2016 03:00 | permanent link

## April 30, 2016

### Books to Read While the Algae Grow in Your Fur, April 2016

Attention conservation notice: I have no taste.

Ruth Downie, Medicus; Terra Incognita; Persona Non Grata; Caveat Emptor; Semper Fidelis; Tabula Rasa
Mind candy: historical mysteries set in early 2nd century Roman Britain (and southern Gaul), following the mis-adventures of a Roman legionary doctor and his British wife. (Well, originally Tilla is his slave, but it's complicated.) They are, for me, absolute catnip, and the perfect thing to binge read while in the stage of recovering from food-poisoning where I can read but can't do anything more useful. (I also can't help thinking that they are exactly the sort of thing my grandmother would have loved.)
Sequel.
Kathleen George, A Measure of Blood
Yet another Pittsburgh-centric mystery, taking place largely in the mind of the murderer. Much of the action happens around the University of Pittsburgh, i.e., just down the street.
John Brunner, The Gaudy Shadows
Mind candy, and not exactly recommended. Brunner was one of the great science fiction writers, the publishers of the ancient paperback edition I have played this up, and there is in fact a very light science-fictional angle to the story. But really it's a mystery novel which is very much a period piece of Swinging London. I enjoyed it, but I also found it funny in ways I doubt Brunner intended. For Brunner completists (in which case, this is, astonishingly, available electronically), or those seeking documents of the milieu.
Scott Hawkins, The Library at Mount Char
Strictly speaking, this is a contemporary fantasy novel set in exurban Virginia, where the main characters are American children who have been selected by a nigh-omniscient teacher to learn the mystic arts at the titular library. What raises it above the level of mind candy is the fact that such a description give you no idea whatsoever of how strange this story is, either in its content or in its narration. Hawkins is obviously showing off from the very first lines (which hooked me), and makes basically no concessions for weak readers. He also has a pitiless quality towards his characters which I, for one, found very agreeable. The only thing I can begin to compare it to is somebody reading Shadowland, and then saying "That was really good, but Peter Straub's imagination is just too nice and normal". Even that doesn't really convey how impressive a performance this is.
(Picked up on Kameron Hurley's recommendation.)
Jen Williams, The Copper Promise
Mind candy: old-school fantasy, clearly inspired by role-playing games (there are both dungeons, plural, and dragons), but very enjoyably written, delivering the pleasures of light-hearted adventure without being either morally obtuse or wallowing in self-satisfied grimdarkness. It's self-contained, but at least one sequel has come out in the UK already, and both will appear in the US within the year.
I forget where I saw this recommended, but whoever it was, thank you; and additional thanks to a surprisingly-good used English-language bookstore in Amsterdam last summer.
Eric Smith and Harold J. Morowitz, The Origin and Nature of Life on Earth: The Emergence of the Fourth Geosphere
To quote some know-it-all from the dust-jacket, "This is a truly unusual work of scholarship, which offers both novel perspectives on a huge range of disciplines and a model of scientific synthesis. This is a remarkable, and remarkably impressive, book." --- I will try to say more about this book in the coming month.
Disclaimer: Eric is one of the smartest people I've ever met, and, despite that, a friend.
Kelley Armstrong, Forest of Ruin
Mind candy fantasy: a satisfying conclusion to the series, but not quite as satisfying to me as if \SPOILER had not turned out so happily. (On the other hand, I really didn't see that particular twist coming.) Jack Campbell, The Pirates of Pacta Servanda Mind candy, continuing the story from previous volumes, and basically incomprehensible without them. In this installment, a group of ideological extremists our heroes establish a safe-haven in a failed state find refuge from the whole of the international community their enemies, running guns to support one warlord over another defending innocent civilians and the last remnants of a traditional monarchy. Catherine Wilson, Epicureanism at the Origins of Modernity A gracefully written survey of Epicurean themes in philosophy and science, and to a lesser extent general literary culture, during the 17th century — as in Bacon, Boyle, Hobbes, Locke, Descartes, Spinoza, various erudite libertines, etc. Wilson considers physical, moral and meta-physical ideas, all at a very qualitative level. (E.g., she says relatively little --- though not nothing --- about the increasing role of mathematics in 17th century physical speculations, which from my perspective is one of the biggest differences between ancient atomism and its early-modern descendant.) Very appropriately, she also covers anti-Epicurean reactions, like that of Leibniz, including discussing what they owed to their opponents. The organization is thematic rather than chronological, but the themes are themselves fairly logically arranged. It definitely presumes a broad familiarity with 17th century thought, but not much knowledge of Epicureanism, and it's very skillfully presented. This is the first book of Wilson's I've read, but lots of her stuff looks interesting and I will certainly be tracking down more. Dream Street: W. Eugene Smith's Pittsburgh Project Beautiful, beautiful photographs of the city from 1955--1957. (Many but not all of them can be seen online through Magnum.) The composition and selection are both incredible. Smith was evidently a real piece of work, but still the story of a multi-year, career-wrecking obsession with capturing the whole of the life of a city feels, except for the technology, as though it were ripped straight from the Romantic period. (My neighborhood seems to have changed remarkably little in its character over the last sixty years.) Patrick Manning, Slavery and African Life: Occidental, Oriental, and African Slave Trades A short but compendious history of the African slave trades --- to the Americas and other European colonies, to north Africa, southwest and south Asia ("oriental"), and within Africa --- their place in world history, their impact on African societies, and their all-too-gradual dissolution. An intriguing feature is the use of a demographic simulation --- what I'd call a "compartmental model" --- to estimate the historical sizes of the populations from which slaves were drawn, and so the impact of the slave trade on population growth and sex ratios within Africa. It would be very interesting to re-do the estimation here. (Thanks to Prof. Manning for lending me a copy of his book.) Tony Cliff, Delilah Dirk and the King's Shilling Comic book mind candy, in which Miss Dirk and Mister Selim find themselves compelled to go to England, and mayhem and social sniping ensue. (Previously) Marie Brennan, In the Labyrinth of Drakes Mind candy, enjoyable fantasy of 19th century natural history and archaeology division. (Previously.) N. K. Jemisin, The Fifth Season Epic fantasy, but I think it rises above the level of mind candy. The approach to story-telling starts out by looking like bog-standard epic fantasy, if well done, but then gets more complicated and interesting (in spoilerish ways). Even better is the world-building: a planet where plate tectonics is so active that the dominant ideology is that the Earth is our father, and he hates us. The "Fifth Season" of the title are the irregular geological disasters which make the only known continent nearly uninhabitable; their depiction is at once chilling and clearly a labor of love. (If it is wrong to be charmed by the range and depths of her catastrophes, then I don't want to be right.) Because this is a fantasy novel, there is also a minority group which has the useful ability of being able to quell these disasters. (Jemisin, characteristically, has thought about the thermodynamics.) They are simultaneously valued for their abilities and despised for their different-ness, with a range of plausible racial stereotypes, more or less internalized by the enslaved members of the group. Because Jemisin is a good novelist, none of this maps exactly on to any real-world minority. There sequel is coming later this year, and can hardly arrive too soon. Posted at April 30, 2016 23:59 | permanent link ## April 20, 2016 ### In memoriam Prita Shireen Kumarappa Shalizi Posted at April 20, 2016 09:50 | permanent link ## April 19, 2016 ### Course Announcements: Statistical Network Models, Fall 2016 Attention conservation notice: Self-promotion, and irrelevant unless you (1) will be a student at Carnegie Mellon in the fall, or (2) have a morbid curiosity about a field in which the realities of social life are first caricatured into an impoverished formalism of dots and lines, devoid even of visual interest and incapable of distinguishing the real process of making movies from a mere sketch of the nervous system of a worm, and then further and further abstracted into more and more recondite stochastic models, all expounded by someone who has never himself taken a class in either social science or any of the relevant mathematics. Two, new, half-semester courses for the fall: 36-720, Statistical Network Models 6 units, mini-semester 1; Mondays and Wednesdays 3:00--4:20 pm, Baker Hall 235A This course is a rapid introduction to the statistical modeling of social, biological and technological networks. Emphasis will be on statistical methodology and subject-matter-agnostic models, rather than on the specifics of different application areas. There are no formal pre-requisites, and no prior experience with networks is expected, but familiarity with statistical modeling is essential. Topics (subject to revision): basic graph theory; data collection and sampling; random graphs; block models and community discovery; latent space models; "small world" and preferential attachment models; exponential-family random graph models; visualization; model validation; dynamic processes on networks. 36-781, Advanced Network Modeling 6 units, mini-semester 2; Tuesdays and Thursdays 1:30--2:50 pm, Wean Hall 5312 Recent work on infinite-dimensional models of networks is based on the related notions of graph limits and of decomposing symmetric network models into mixtures of simpler ones. This course aims to bring students with a working knowledge of network modeling close to the research frontier. Students will be expected to complete projects which could be original research or literature reviews. There are no formal pre-requisites, but the intended audience consists of students who are already familiar with networks, with statistical modeling, and with advanced probability. Others may find it possible to keep up, but you do so at your own risk. Topics (subject to revision): exchangeable networks; the Aldous-Hoover representation theorem for exchangeable network models; limits of dense graph sequences ("graphons"); connection to stochastic block models; non-parametric estimation and comparison; approaches to sparse graphs. 720 is targeted at first-year graduate students in statistics and related fields, but is open to everyone, even well-prepared undergrads. Those more familiar with social networks who want to learn about modeling are also welcome, but should probably check with me first. 781 is deliberately going to demand rather more mathematical maturity. Auditors are welcome in both classes. Posted at April 19, 2016 16:00 | permanent link ## April 15, 2016 ### "Network Comparisons Using Sample Splitting" My fifth Ph.D. student is defending his thesis towards the end of the month: Lawrence Wang, Network Comparisons Using Sample Splitting Abstract: Many scientific questions about networks are actually network comparison problems: Could two networks have reasonably come from a common source? Are there specific differences? We outline a procedure that tests the hypothesis that multiple networks were drawn from the same probabilistic source. In addition, when the networks are indeed different, our procedure may characterize the differences between the sources. We first address the case where the two networks being compared share the same exact nodes. We wish to use common parametric network models and the standard likelihood ratio test (LRT), but the infeasibility of computing the maximum likelihood estimate in our selected families of models complicates matters. However, we take advantage of the fact that the standard likelihood ratio test has a simple asymptotic distribution under a specific restriction of the model family. In addition, we show that a sample splitting approach is applicable: We can use part of the network data to choose an appropriate model space, and use the remaining network data to compute the LRT statistic and appeal to its asymptotic null distribution to obtain an appropriate p-value. Moreover, we show that while a single sample split results in a random p-value, we can choose to do multiple sample splits and aggregate the resulting individual p-values. Sample splitting is a more general framework --- nothing is particularly special about the specific hypothesis we decide to test. We illustrate a couple of extensions of the framework which also provide different ways to characterize differences in network models. We also address the more general case where the two networks being compared no longer share the same set of nodes. The main difficulty in this case is that there might not be an implicit alignment of the nodes in the two networks. Our procedure relies on the graphon model family which can handle networks of any size, but more importantly can be put in an aligned form which makes it comparable. We show that the framework for alignment can be generalized, which allows this method to handle a larger class of models. Time and place: 3:30 pm on Monday, 25 April 2016 in Porter Hall A22 Posted at April 15, 2016 12:00 | permanent link ## April 01, 2016 ### You Think This Is Bad Attention conservation notice: Note the date. Any intelligent and well-intentioned person should have a huge, even over-riding preference for leaving existing social and political institutions and hierarchies alone, just because they are the existing ones. Obviously this can't rest on any presumption that existing institutions are very good, or very wise, or embody any particularly precious values, or are even morally indifferent. They are not. It would also be stupid to appeal to some sub-Darwinian notion that our institutions, just because they have come down to us, and so must have survived an extensive process of selection, are therefore adaptive. At best, that would show the institutions were good at reproducing themselves from generation to generation, not that they had any human or ethical merit. In any case the transmission of any tradition by human beings is inevitably partial and re-interpretive, and so we have no reason to defer to tradition as such. Stare decisis conservatism rests instead on much less cosy grounds: However awful things are now, they could always be worse, and humanity is both too dumb to avoid making things worse, and too mean to want to avoid making things worse even when it could. The point about stupidity is elemental. If someone complains that an existing institution is unjust (or unfair, oppressive, etc.), their complaint only has force if a more just alternative is possible. (Otherwise, take it up with the Management.) But it only has political force if that more just alternative is not only possible, but we can figure out what it is. This, we a signally unsuited to do. Social science can tell us many interesting things, but on the most crucial questions of "What will happen if we do this?", we get either dogmatic, experimentally-falsified ideology (economics), or everything-is-obvious-once-you-know-the-answers just-so myths (every other branch of social science). "Try it, and see what happens" is the outer limit of social-scientific wisdom. This is no basis on which to erect a reliable social engineering, or even social handicrafts. When we try to deliberately change our institutions, we are, at best, guided by visions, endemic and epidemic superstitions, evidence-based haruspicy, and the academic version of looking at a list of random words and declaring they all relate to motel service. We have no basis to think that our reforms, if we can even implement them, will rectify the injustice that first aroused our ire, our pity, or our ambition, much less that the attempt won't create even worse problems. Even getting our pet reform implemented is often going to be hopeless, because so much of our collective knowledge about how to get things done, socially, is tacit. That knowledge is not anything which its holders can put into words, or into a computer, much less into a schedule of prices, but is rather buried in their habits and inarticulate skills. Often these are the habits and skills of a very small number of crucially-placed people, who are, not so coincidentally, vested in the existing institutions and complicit in the existing injustices. Even more, these are habits and skills which only work in a particular environment, usually a social environment. The same people, asked to make a modified institution work, will be less effective, even hopeless. Throwing the bums out gets rid of the people who knew how to get things done. Finally, and most crucially, think about what happens when existing institutions and arrangements are disturbed. Social life is always full of a clash of conflicting interests. (One of the few things the economists have right is that inside every positive-sum interaction, there is a negative-sum struggle over who gets the gains from cooperation.) When an institution seems settled, eternal, it fades from view, nobody fights over it. Its harsher lines may be softened by compassion (and condescension) on the side of those it advantages, or local and unofficial accommodations and arrangements, or even just from it being too much trouble to exploit it to the hilt. But question the institution, disturb it, make it obvious that there is something to fight over, and what happens? Those who gain from the injustice won't give it up merely because that would be right. Instead, they will press to keep what they have --- and even to claim more. Since this has become an open conflict of power, what emerges is not going to favor the lowly, poor and the weak. Or if that area of social life should, for a time, descend into chaos, well, the tyranny of structurelessness is real, and those who benefit from it are, again, those who are already advantaged, and willing to exploit those advantages. Things might be very different if people were able to agree on justice, and willing to follow it, but they are not. To recapitulate: People are foolish, selfish and cruel. This means that our institutions are always grossly unjust. But it also means that we don't know how to really make things better. It further means that trying to change anything turns it into a battlefield, where nothing good happens to anybody, least of all the weak and oppressed. Since our current institutions are at least survivable (proof: we've survived them), it's better to leave them alone. They'll change anyway, and that will cause enough grief, without deliberately courting more by ignorant meddling. Of course, people who actually defend inherited institutions and arrangements just because they're inherited, — such people can usually be counted on the fingers of one fist. Corey Robin would argue — and he has a case — that the impulse behind most actually-existing conservatism is a positive liking for hierarchy. This was an attempt at trying to construct a case for conservatism which would employ all three of Hirschman's tropes of reactionary rhetoric, but also wouldn't fall apart at the first skeptical prod. (Readers who point me at Hayek will be ignored; readers who point me at "neo-reactionaries" will be mocked.) What I have written is still an assembly of fallacies, half-truths and hyperboles, but I flatter myself it would still stand a little inspection. Posted at April 01, 2016 00:01 | permanent link ## March 31, 2016 ### Books to Read While the Algae Grow in Your Fur, March 2016 Attention conservation notice: I have no taste. Guido W. Imbens and Donald B. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction While I found less to disagree with about the over-all approach than I anticipated, I am genuinely surprised (not "shocked, shocked!" surprised) to find so much sloppiness in the mere data analysis. I can't recommend this book to anyone who isn't already well-trained in applied statistics. To say any more here would preempt my review for JASA, so I'll just link to that when it's out. — I will however mention one grumble, which didn't fit in the review. From p. 174: The possible advantage of the frequentist approach [over the Bayesian] is that it avoids the need to specify the prior distribution p(\theta) $for the parameters governing the joint distribution of the two potential outcomes. However, this does not come without cost. Nearly always one has to rely on large sample approximations to justify the derived frequentist confidence intervals. But in large samples, by the Bernstein-Von Mises Theorem (e.g., Van Der Vaart, 1998), the practical implications of the choice of prior distribution is limited, and the alleged benefits of the frequentist approach vanish. I don't see how to unpack everything objectionable in these few sentences without rehearsing the whole of this post, and adding "the bootstrap is a thing, you know". Tarquin Hall, The Case of the Love Commandos Mind candy: the latest in the mystery series, though enjoyable independently; this time, we find Vish Puri unwillingly drawn into the nexus of caste and politics in rural Uttar Pradesh. Jack Campbell, The Dragons of Dorcastle, The Hidden Masters of Marandur, The Assassins of Altis Mind candy science fantasy. There are some thematic similarities to Rosemary Kirstein's (much superior) Steerswomen books. Those themes are, as it were, here transcribed into the key of Teen's Own Adventures (Campbell gets points for having the Heroic Engineer with a Destiny be a young woman), with less compelling world-building than Kirstein. Still, I zoomed through these and await the sequels. ROT-13'd for spoilers: Bar jnl va juvpu Xvefgrva'f obbxf ner fhcrevbe vf gung ure cebgntbavfgf unir gb npghnyyl svther bhg gur uvqqra gehguf bs gurve jbeyq, jurernf Pnzcoryy gnxrf gur ynml snagnfl-jevgre jnl bhg bs univat gurer or uvqqra fntrf jub pna whfg gryy gur urebrf rirelguvat. Nyfb, V nz abg fher V unir rire frra "orpnhfr bs dhnaghz!" hfrq fb funzryrffyl ol nal jevgre jub jnfa'g n zrqvpny dhnpx. Paul McAuley, Into Everywhere Further into the future of his (excellent) Something Coming Through, in which finding that we are only the latest in a galaxy full of the remains of much older, much more powerful, and much weirder alien civilizations is not very good for humanity. For instance, the scientific method seems to atrophy as we move up the time-line, in much the way Chomsky fears will result from cheap computing [*]. There is a reason for this. ROT-13'd for spoilers: Gur eriryngvba ng gur raq, gung gur gehr nvz bs nyy guvf nyvra zrqqyvat vf abg gb qb fbzrguvat gb uhznavgl ohg gb trg hf gb cebqhpr NVf, orpnhfr gur shgher bs nal vagryyvtrag yvarntr vf hygvzngryl znpuvarf, vf bs pbhefr fgenvtug bhg bs Pynexr'f 2001. Guvf yrnqf zr gb jbaqre jurgure gurfr abiry'f nera'g ZpNhyrl va qvnybthr jvgu Pynexr, rfcrpvnyyl jvgu 2001 rg frd. naq gur Guveq Ynj, va zhpu gur jnl gung, fnl, Pbasyhrapr jnf ZpNhyrl va qvnybthr jvgu Jbysr naq gur Obbx bs gur Arj Fha. *: From Chapter 59, "Synchronicity": They didn't appear to use any kind of analytical reasoning to confirm their conjectures, employing instead a crude form of experimental Darwinism, seeding a matrix with algorithms modelling variations of their initial assumptions and letting them run to a halting state, selecting those that most resembled the observed conditions, and running and re-running everything over and over again until they had derived an algorithm that reproduced reality to an agreed level of statistical confidence. The wizards didn't care that this method gave no insights into the problems it attacked, or that they didn't understand how the solutions it yielded were related to the vast edifice of Euclidean mathematical theory. They weren't interested in theory. As far as they were concerned, if an algorithm gave the right answer, then plug it in: it was good to go. Sven Beckert, Empire of Cotton This is a really good global history of the development of the world's cotton industry from the opening of trans-Atlantic navigation down through about 1950. (An epilogue considers later events, but very cursorily.) The central incident is of course the industrial revolution that began in England in the late 18th century, which could only attain the scale it did because there were other parts of the world, notably the Americas, which could supply cotton on the requisite industrial scale; they did so through slavery. After abolition, the Americas also provided the pattern for making sure formally-free rural cultivators produced cotton for the market, rather than farming for subsistence, a pattern eagerly and often explicitly copied by imperial powers across the globe. Cotton was not just the first truly modern industry, it was for a long time the most important, and is arguably still one of the most important on a global scale, and so its story is, in large part, the story of how we got here. I have, as a supremely unqualified but opinionated non-historian, some quibbles. Stylistically, he over-uses pet phrases like "the empire of cotton" and "the white gold", and keeps reminding readers that they are probably wearing cotton. Analytically, and more seriously, Beckert makes much more of this world-wide division of labor than of machinery, which is a mistake. Industrialism within one country (say, the American south) would have been quite feasible; a worldwide capitalism limited to animal power and manual labor would be at best a flexible and adaptive poverty. His account of the decline of cotton manufacturing in Europe and North America in the 20th century refers only to the difference in wages between those countries and places like China or India, ignoring differences in productivity. On a different plane, this is possibly the only genuinely crypto-Marxist book to ever win the Bancroft prize. The over-lap in themes just with Capital is very striking: the violence of capitalist primitive accumulation, the division of labor on a world scale, the struggle over the working day in the Lancashire mills, the deep importance attached to the American Civil War, the praise of capitalism for developing the productive forces to the point where something better becomes feasible and necessary. And also some post-Marx Marxist themes: late 19th century imperialism as driven by rivalry among capitalists, an autonomous role for the state (as something more than just an executive committee for managing the affairs of the bourgeoise, though it is that too), very odd statements about the Soviet Union and Maoist China. That Marx is mentioned only once, and that in passing, is surely no coincidence. Emily Horne and Joey Comeau, Anatomy of Melancholy: The Best of A Softer World Calling A Softer World one of the best web-comics gives no idea whatsoever of its merits. I was deeply saddened to learn it would end in 2015, and only partially consoled by the prospect of this book. I commend it to anyone who reads this blog with pleasure. Elliott Kay, Dead Man's Debt Mind candy: sequel to Poor Man's War and Rich Man's Fight, bringing the series to a satisfying stopping point. Probably not enjoyable without the previous books. Salla Simukka, As Red as Blood, As White as Snow, As Black as Ebony Mind candy, aptly described by James Davis Nicoll as what would happen if a plucky girl detective like Nancy Drew wandered into a Kurt Wallander novel?" Hilary Mantel, Bring Up the Bodies Mind candy: Further literary, historical competence porn. Christopher Hayes, Twilight of the Elites First, go read the review/precis by Aaron Swartz (peace be upon him). I started this four years ago, then set it aside when I got busy, and only now took it back up (for obvious reasons) and finished it. Part of me reads this going "preach, brother, preach!" In particular, the Iron Law of Meritocracy seems like a real contribution. (Pedantically, priority goes to James Flynn, though.) As a product of the meritocracy, whose parents are also products of the meritocracy, and who makes his living teaching at an elite school, this is not happy news, but there we are. Other parts would like to see more: allowing that we have a self-serving and dysfunctional elite now, were previous elites really any more functional or less self-serving? This is hardly an obvious point or one Hayes establishes. Generally, Hayes seems strongest when he's documenting the ways things are bad now, but he also needs to say that they're worse than before, or at are bad in new ways, and that's lacking. [*] Of course this amounts to asking that he have written a different and much more academic book. As something at the border between a political tract and popular social science by a working journalist, it's astonishingly good. *: Reading the bits about how the country feels like it's falling apart, I couldn't help thinking of the incredible opening to Joan Didion's "Slouching Towards Bethlehem": The center was not holding. It was a country of bankruptcy notices and public-auction announcements and commonplace reports of causal killings and misplaced children and abandoned homes and vandals who misspelled even the four-letter words they scrawled. It was a country in which families routinely disappeared, trailing bad checks and repossession papers. Adolescents drifted from city to torn city, sloughing off both the past and the future as snakes shed their skins, children who were never taught and would never now learn the games that had held the society together. People were missing. Children were missing. Parents were missing. Those left behind filed desultory missing-persons reports, then moved on themselves. Of course, Didion goes on: It was not a country in open revolution. It was not a country under enemy siege. It was the United States of America in the cold late spring of 1967, and the market was steady and the G.N.P. high and a great many articulate people seemed to have a sense of high social purpose and it might have been a spring of brave hopes and national promise, but it was not, and more and more people had the uneasy apprehension that it was not. All that seemed clear was that at some point we had aborted ourselves and butchered the job... Seanan McGuire, Indexing: Reflections Mind candy: more contemporary fantasy about fairy tales trying to escape from the dungeon dimensions and/or collective unconscious. Posted at March 31, 2016 23:59 | permanent link ## March 19, 2016 ### "Reassembling the History of the Novel" Attention conservation notice: Only of interest if you (1) care about the quantitative history of English novels, and (2) will be in Pittsburgh at the end of the month. I had nothing to do with making this happen — Scott Weingart did — but when the seminar gods offer me something this relevant to my interests, it behooves me to promote it: Allen Riddell, "Reassembling the History of the Novel" Abstract: How might the 19th century novel be studied and taught if all (surviving) novels were readily available to students and researchers? While many have lamented the fact that literary historians tend to ignore works outside the "canonical fraction" of the ~25,000 novels published in the British Isles during the 19th century, there have been few concrete proposals addressing the question of how surviving novels might productively enter research and teaching and participate in our thinking about the nexus of literature and society. This presentation describes the prospects for a data-intensive and sociologically-inclined history of the novel focused on the population of published novels, the novels' writers, and the writers' penumbra. (A group's penumbra is the set of individuals acquainted with members of the group.) Marshalling evidence from a range of sources and aided by probabilistic models of text data, I will demonstrate how this approach yields insights into two significant developments in the history of the English novel: (1) the rapid influx of male writers after 1815, and (2) the dramatic increase in the rate of publication of novels after 1830. The presentation also features a discussion of Franco Moretti's call, echoing Karl Popper, that literary historians should advance risky---and, in some cases, "testable"---hypotheses. Time and place: 4:30--5:30 pm on Wednesday, 30 March 2016 in Studio A, Hunt Library (first floor) As always, the talk is free and open to the public. Posted at March 19, 2016 20:24 | permanent link ## February 29, 2016 ### Books to Read While the Algae Grow in Your Fur, February 2016 Attention conservation notice: I have no taste. Douglas A. Blackmon, Slavery by Another Name: The Re-Enslavement of Black Americans from the Civil War to World War II The story told here is just as appalling as the sub-title promises. Blackmon focuses on Alabama, but makes it clear that stuff like this happened all over the South. Since this is popular rather than professional history, there is a bit more of you-are-there detail than I am completely comfortable with, and I wish there had been more about things like the Great Migration and the impact of agricultural mechanization. But it's still very well written, and the story it tells deserves to be much better known. Two comparatively minor points: 1. There were actually cases under Theodore Roosevelt of white men in the south being brought to court for holding black men as slaves. (The legal defense was that while amendments to the Constitution had banned slavery, there were no actual laws against it, so no crime.) This has all the elements which a big strand of our popular mythology looks for: a courtroom drama in which a fearless prosecutor and dedicated investigators, with the support of a reforming president, uncover a vast criminal enterprise, persuade reluctant witnesses to testify, bring the case before the public eye and an honest judge — and the whole thing failed to do the slightest bit of good. I think Blackmon has to be aware of how this part of his narrative fits with these motifs, but fails to have the expected ending; it's probably all the more effective for his not being explicit about it. 2. It is probably irrational to feel more of a shameful connection to these injustices because U.S. Steel (and so Andrew Carnegie, and so Carnegie Tech) was one of the beneficiaries Blackmon highlights, but I do. Matt Ruff, Lovecraft Country Victor LaValle, The Ballad of Black Tom Mind candy. Ruff's book follows the mis-adventures of an African-American family of science fiction fans in 1950s Chicago, in a world where it's not clear whether eldritch abominations or ordinary life is more soul-destroying. It's a bit episodic, but still well done. LaValle's novella is a re-imagining of one of Lovecraft's most racist stories, "The Horror at Red Hook", from the perspective of a black Harlemite who would have been, at best, a nameless minion in the original. It's an interesting choice of a work to re-imagine, because even drawing a veil over the bigotry, it's not one of Lovecraft's better stories. Why respond to an ugly piece of bad fiction from almost a century ago? The only good reason is that there is, underneath all the purple prose and the all-too-transparent fears, something of real imaginative power and value in Lovecraft's work, and that value should be even to those whom he cast as monsters. The fact that LaValle is much better at cosmic horror than Lovecraft was in "Red Hook" is just icing on the cake. If this is intriguing, it's worth reading LaValle and Ruff in conversation. (Previously for Ruff; previously for LaValle.) Jo Walton, The Just City This is, obviously, exactly what would happen if Athena and Apollo conspired to realize The Republic with a population of time-traveling Platonists, 10,800 child slaves bought in antiquity, and robots. Exactly what would happen, down to Socrates trolling everyone so hard that, well --- read it. Genre note: I thought the chapters from Simmea's viewpoint did a very good job of both sounding plausible, and playing off the now-well-worn conventions of young adult dystopias. Because, of course, from a certain angle that's what the the Republic would be. (Shoved to the top of the pile by the outstanding Crooked Timber symposium on this book and its sequel [which is on its way to me].) Robert Jackson Bennett, City of Blades Mind-candy fantasy, sequel to City of Stairs, continuing the story of how the first technological power in a fantasy world deals with the consequences of having killed all the gods. It is as awesome as its predecessor, though I should perhaps say that Bennett is quite prepared to deal brutally with sympathetic characters. (There was a moment near the end where I thought he was going to reprise the cyclical metaphysics of Mr. Shivers, but fortunately I was wrong.) J. H. Conway, Regular Algebra and Finite Machines I liked the first half or so. In particular, the notion of the derivative of one regular event with respect to another is neat in itself, and the corresponding Taylor series gives a very direct way of translating a regular expression into a finite machine. But then Conway zoomed off into the algebraic stratosphere, and if there was any tether connecting him back to actual problems with formal languages or automata, I completely lost track of it, and didn't see the point. (This is formally self-contained as far as automata and language theory goes, but definitely presumes a strong grasp of abstract algebra. Its full appreciation also evidently presumes more mathematical maturity than I possess.) Posted at February 29, 2016 23:59 | permanent link ## February 25, 2016 ### "Analyzing large-scale data: Taxi Tipping behavior in NYC" (This Week at the Statistics Seminar) Attention conservation notice: Only of interest if you (1) care about large-scale data analysis and/or taxis, and (2) will be in Pittsburgh on Thursday Friday. The last but by no means least talk seminar talk this week: Taylor Arnold, "Analyzing large-scale data: Taxi Tipping behavior in NYC" Abstract: Statisticians are increasingly tasked with providing insights from large streaming data sources, which can quickly grow to be terabytes or petabytes in size. In this talk, I explore novel approaches for applying classical and emerging techniques to large-scale datasets. Specifically, I discuss methodologies for expressing estimators in terms of the (weighted) Gramian matrix and other easily distributed summary statistics. I then present an abstraction layer for implementing chunk-wise algorithms that are interoperable over many parallel and distributed software frameworks. The utility and insights garnered from these methods are shown through an application to an event based dataset provided by the New York City Taxi and Limousine Commission. I have joined these observations, which detail every registered taxicab trip from 2009 to the present, with external sources such as weather conditions and demographics. I use the aforementioned techniques to explore factors associated with taxi demand and the tipping behavior of riders. My focus is on developing novel techniques to facilitate interactive exploratory data analysis and to construct interpretable models at scale. Time and place: 4:30--5:30 pm on Thursday, 25 February 2016, in Baker Hall A51 4:30--5:30 pm on Friday, 26 February 2016, in Baker Hall A51 As always, the talk is free and open to the public. Update: Dr. Arnold's talk has been pushed back a day due to travel delays. Posted at February 25, 2016 11:16 | permanent link ### Denying the Service of a Differentially Private Database Attention conservation notice: A half-clever dig at one of the more serious and constructive attempts to do something about an important problem that won't go away on its own. It doesn't even explain the idea it tries to undermine. Jerzy's "cursory overview of differential privacy" post brings back to mind an idea which I doubt is original, but whose source I can't remember. (It's not Baumbauer et al.'s "Fool's Gold: an Illustrated Critique of Differential Privacy" [ssrn/2326746], though they do make a related point about multiple queries.) The point of differential privacy is to guarantee that adding or removing any one person from the data base can't change the likelihood function by more than a certain factor; that the log-likelihood remains within$\pm \epsilon$. This is achieved by adding noise with a Laplace (double-exponential) distribution to the output of any query from the data base, with the magnitude of the noise being inversely related to the required bound$\epsilon$. (Tighter privacy bounds require more noise.) The tricky bit is that these$\epsilon$s are additive across queries. If the$i^{\mathrm{th}}$query can change the log-likelihood by up to$\pm \epsilon_i$, a series of queries can change the log-likelihood by up to$\sum_{i}{\epsilon_i}$. If the data-base owner allows a constant$\epsilon$per query, we can then break the privacy by making lots of queries. Conversely, if the$\epsilon$per query is not to be too tight, we can only allow a small number of constant-$\epsilon$queries. A final option is to gradually ramp down the$\epsilon_i$so that their sum remains finite, e.g.,$\epsilon_i \propto i^{-2}$. This would mean that early queries were subject to little distortion, but latter ones were more and more noisy. One side effect of any of these schemes, which is what I want to bring out, is that they offer a way to make the database unusable, or nearly unusable, for everyone else. I make the queries I want (if any), and then flood the server with random, pointless queries about the number of cars driven by left-handed dentists in Albuquerque (or whatever). Either the server has a fixed$\epsilon$per query, and so a fixed upper limit on the number of queries, or$\epsilon$grows after each query. In the first case, the server has to stop answering others' queries; in the second, eventually they get only noise. Or --- more plausibly --- whoever runs the server has to abandon their differential privacy guarantee. This same attack would also work, by the way, against the "re-usable holdout". That paper (not surprisingly, given the authors) is basically about creating a testing set, and then answering predictive models' queries about it while guaranteeing differential privacy. To keep the distortion from blowing up, only a limited number of queries can be asked of the testing-set server. That is, the server is explicitly allowed to return NA, rather than a proper answer, and it will always do so after enough questions. In the situation they imagine, though, of the server being a "leaderboard" in a competition among models, the simple way to win is to put in a model early (even a decent model, for form's sake), and then keep putting trivial variants of it in, as often as possible, as quickly as possible. This is because each time I submit a model, I deprive all my possible opponents of one use of the testing set, and if I'm fast enough I can keep them from ever having their models tested at all. Posted at February 25, 2016 11:09 | permanent link ## February 24, 2016 ### On the Ethics of Expert Advocacy Consulting Attention conservation notice: A ponderous elaboration of an acerbic line by Upton Sinclair. Written so long ago I've honestly forgotten what incident provoked it, then left to gather dust and re-discovered by accident. A common defense of experts consulting for sometimes nefarious characters in legal cases is that the money isn't corrupting, if the expert happens to agree with the position anyway already. So, for instance, if someone with relevant expertise has doubts about the link between cigarette smoking and cancer, or between fossil-fuel burning and global warming, what harm does it do if they accept money from Philip Morris or Exxon, to defray advocating this? By assumption, they're not lying about their expert opinion. The problem with this excuse is that it pretends people never change their ideas. When we deal with each other as more-or-less honest people — when we treat what others say as communications rather than as manipulations — we do assume that those we're listening to are telling us things more-or-less as they see them. But we are also assuming that if the way they saw things changed, what they said would track that change. If they encountered new evidence, or even just new arguments, they would respond to them, they would evaluate them, and if they found them persuasive, they would not only change their minds, they would admit that they had done so. (Cf.) We know that can be galling for anyone to admit that they were wrong, but that's part of what we're asking for when we trust experts. And now the problem with the on-going paid advocacy relationship becomes obvious. It adds material injury to emotional insult as a reason not to admit that one has changed one's mind. The human animal being what it is, this becomes a reason not to change one's mind --- to ignore, or to explain away, new evidence and new argument. Sometimes the new evidence is ambiguous, the new argument has real weaknesses, and then this desire not to be persuaded by it can perform a real intellectual function, with each side sharpening each other. (You could call this "the cunning of reason" if you wanted to be really pretentious.) But how is the non-expert to know whether your objections are really sound, or whether you are desperately BS-ing to preserve your retainer? Maybe they could figure it out, with a lot of work, but they would be right to be suspicious. Posted at February 24, 2016 00:26 | permanent link ## February 21, 2016 ### On the Uncertainty of the Bayesian Estimator Attention conservation notice: A failed attempt at a dialogue, combining the philosophical sophistication and easy approachability of statistical theory with the mathematical precision and practical application of epistemology, dragged out for 2500+ words (and equations). You have better things to do than read me vent about manuscripts I volunteered to referee. Scene: We ascend, by a dubious staircase, to the garret loft space of Confectioner-Stevedore Hall, at Robberbaron-Bloodmoney University, where we find two fragments of the author's consciousness, temporarily incarnated as perpetual post-docs from the Department of Statistical Data Science, sharing an unheated office. Q: Are you unhappy with the manuscript you're reviewing? A: Yes, but I don't see why you care. Q: The stabbing motions of your pen are both ostentatious and distracting. If I listen to you rant about it, will you go back to working without the semaphore? A: I think that just means you find it easier to ignore my words than anything else, but I'm willing to try. Q: So, what is getting you worked up about the manuscript? A: They take a perfectly reasonable — though not obviously appropriate-to-the-problem — regularized estimator, and then go through immense effort to Bayesify it. They end up with about seven levels of hierarchical priors. Simple Metropolis-Hastings Monte Carlo would move as slowly as a continental plate, so they put vast efforts into speeding it up, and in a real technical triumph they get something which moves like a glacier. Q: Isn't that rather fast these days? A: If they try to scale up, my back-of-the-envelope calculation suggests they really will enter the regime where each data set will take a single Ph.D. thesis to analyze. Q: So do you think that they're just masochists who're into frequentist pursuit, or do they have some reason for doing all these things that annoy you? A: Their fondness for tables over figures does give me pause, but no, they claim to have a point. If they do all this work, they say, they can use their posterior distributions to quantify uncertainty in their estimates. Q: That sounds like something statisticians should want to do. Haven't you been very pious about just that, about how handling uncertainty is what really sets statistics apart from other traditions of data analysis? Haven't I heard you say to students that they don't know anything until they know where the error bars go? A: I suppose I have, though I don't recall that exact phrase. It's not the goal I object to, it's the way quantification of uncertainty is supposed to follow automatically from using Bayesian updating. Q: You have to admit, the whole "posterior probability distribution over parameter values" thing certainly looks like a way of expressing uncertainty in quantitative form. In fact, last time we went around about this, didn't you admit that Bayesian agents are uncertain about parameters, though not about the probabilities of observable events? A: I did, and they are, though that's very different from agreeing that they quantify uncertainty in any useful way — that they handle uncertainty well. Q: Fine, I'll play the straight man and offer a concrete proposal for you to poke holes in. Shall we keep it simple and just consider parametric inference? A: By all means. Q: Alright, then, I start with some prior probability distribution over a finite-dimensional vector-valued parameter $\theta$, say with density $\pi(\theta)$. I observe $x$ and have a model which gives me the likelihood $L(\theta) = p(x;\theta)$, and then my posterior distribution is fixed by $\pi(\theta|X=x) \propto L(\theta) \pi(\theta)$ This is my measure-valued estimate. If I want a set-valued estimate of $\theta$, I can fix a level $\alpha$ and chose a region $C_{\alpha}$ with $\int_{C_{\alpha}}{\pi(\theta|X=x) d\theta} = \alpha$ Perhaps I even preferentially grow $C_{\alpha}$ around the posterior mode, or something like that, so it looks pretty. How is $C_{\alpha}$ not a reasonable way of quantifying my uncertainty about $\theta$? A: To begin with, I don't know the probability that the true $\theta \in C_{\alpha}$. Q: How is it not $\alpha$, like it says right there on the label? A: Again, I don't understand what that means. Q: Are you attacking subjective probability? Is that where this is going? OK: sometimes, when a Bayesian agent and a bookmaker love each other very much, the bookie will offer the Bayesian bets on whether $\theta \in C_{\alpha}$, and the agent will be indifferent so long as the odds are $\alpha : 1-\alpha$. And even if the bookie is really a damn dirty Dutch gold-digger, the agent can't be pumped dry of money. What part of this do you not understand? A: I hardly know where to begin. I will leave aside the color commentary. I will leave aside the internal issues with Dutch book arguments for conditionalization. I will not pursue the fascinating, even revealing idea that something which is supposedly a universal requirement of rationality needs such very historically-specific institutions and ideas as money and making book and betting odds for its expression. The important thing is that you're telling me that $\alpha$, the level of credibility or confidence, is really about your betting odds. Q: Yes, and? A: I do not see why should I care about the odds at which you might bet. It's even worse than that, actually, I do not see why I should care about the odds at which a machine you programmed with the saddle-blanket prior (or, if we were doing nonparametrics, an Afghan jirga process prior) would bet. I fail to see how those odds help me learn anything about the world, or even reasonably-warranted uncertainties in inferences about the world. Q: May I indulge in mythology for a moment? A: Keep it clean, students may come by. Q: That leaves out all the best myths, but very well. Each morning, when woken by rosy-fingered Dawn, the goddess Tyche picks $\theta$ from (what else?) an urn, according to $\pi(\theta)$. Tyche then draws $x$ from $p(X;\theta)$, and $x$ is revealed to us by the Sibyl or the whisper of oak leaves or sheep's livers. Then we calculate $\pi(\theta|X=x)$ and $C_{\alpha}$. In consequence, the fraction of days on which $\theta \in C_{\alpha}$ is about $\alpha$. $\alpha$ is how often the credible set is right, and $1-\alpha$ is one of those error rates you like to go on about. Does this myth satisfy you? A: Not really. I get that "Bayesian analysis treats the parameters as random". In fact, that myth suggests a very simple yet universal Monte Carlo scheme for sampling from any posterior distribution whatsoever, without any Markov chains or burn-in. Q: Can you say more? A: I should actually write it up. But now let's try to de-mythologize. I want to know what happens if we get rid of Tyche, or at least demote her from resetting $\theta$ every day to just picking $x$ from $p(x;\theta)$, with $\theta$ fixed by Zeus. Q: I think you mean Ananke, Zeus would meddle with the parameters to cheat on Hera. Anyway, what do you think happens? A: Well, $C_{\alpha}$ depends on the data, it's really $C_{\alpha}(x)$. Since $x$ is random, $X \sim p(\cdot;\theta)$, so is $C_{\alpha}$. It follows a distribution of its own, and we can ask about $Pr_{\theta}(\theta \in C_{\alpha}(X) )$. Q: Haven't we just agreed that that probability is just $\alpha$ ? A: No, we've seen that $\int{Pr_{\theta}(\theta \in C_{\alpha}(X) ) \pi(\theta) d\theta} = \alpha$ but that is a very different thing. Q: How different could it possibly be? A: As different as we like, at any particular $\theta$. Q: Could the 99% credible sets contain $\theta$ only, say, 1% of the time? A: Absolutely. This is the scenario of Larry's playlet, but he wrote that up because it actually happened in a project he was involved in. Q: Isn't it a bit artificial to worry about the long-run proportion of the time you're right about parameters? A: The same argument works if you estimate many parameters at once. When the brain-imaging people do fMRI experiments, they estimate how tens of thousands of little regions in the brain ("voxels") respond to stimuli. That means estimating tens of thousands of parameters. I don't think they'd be happy if their 99% intervals turned out to contain the right answer for only 1% of the voxels. But posterior betting odds don't have to have anything to do with how often bets are right, and usually they don't. Q: Isn't "usually" very strong there? A: No, I don't think so. D. A. S. Fraser has a wonderful paper, which should be better known, called "Is Bayes Posterior Just Quick and Dirty Confidence", and his answer to his own question is basically "Yes. Yes it is." More formally, he shows that the conditions for Bayesian credible sets to have correct coverage, to be confidence sets, are incredibly restrictive. Q: But what about the Bernstein-von Mises theorem? Doesn't it say we don't have to worry for big samples, that credible sets are asymptotically confidence sets? A: Not really. It says that if you have a fixed-dimensional model, and the usual regularity conditions for maximum likelihood estimation hold, so that $\hat{\theta}_{MLE} \rightsquigarrow \mathcal{N}(\theta, n^{-1}I(\theta))$, and some more regularity conditions hold, then the posterior distribution is also asymptotically $\mathcal{N}(\theta, n^{-1}I(\theta))$. Q: Wait, so the theorem says that when it applies, if I want to be Bayesian I might as well just skip all the MCMC and maximize the likelihood? A: You might well think that. You might very well think that. I couldn't possibly comment. Q: !?! A: Except to add that the theorem breaks down in the high-dimensional regime where the number of parameters grows with the number of samples, and goes to hell in the non-parametric regime of infinite-dimensional parameters. (In fact, Fraser gives one-dimensional examples where the mis-match between Bayesian credible levels and actual coverage is asymptotically $O(1)$.) As Freedman said, if you want a confidence set, you need to build a confidence set, not mess around with credible sets. Q: But surely coverage — "confidence" — isn't all that's needed? Suppose I have only a discrete parameter space, and for each point I flip a coin which comes up heads with probability $\alpha$. Now my $C_{\alpha}$ is all the parameter points where the coin came up heads. Its expected coverage is $\alpha$, as claimed. In fact, if I can come up with a Gygax test, say using the low-significance digits of $x$, I could invert that to get my confidence set, and get coverage of $\alpha$ exactly. What then? A: I never said that coverage was all we needed from a set-valued estimator. It should also be consistent: as we get more data, the set should narrow in on $\theta$, no matter what $\theta$ happens to be. Your Gygax sets won't do that. My point is that if you're going to use probabilities ought to mean something, not just refer to some imaginary gambling racket going on in your head. Q: I am not going to let this point go so easily. It seems like you're insisting on calibration for Bayesian credible sets, that the fraction of them covering the truth be (about) the stated probability, right? A: That seems like a pretty minimal requirement for treating supposed probabilities seriously. If (as on Discworld) "million to one chances turn up nine times out of ten", they're not really million to one. Q: Fine — but isn't the Bayesian agent calibrated with probability 1? A: With subjective probability 1. But failure of calibration is actually typical or generic, in the topological sense. Q: But maybe the world we live in isn't "typical" in that weird sense of the topologists? A: Maybe! The fact that Bayesian agents put probability 1 on the "meager" set of sample paths where they are calibrated implies that lots of stochastic processes are supported on topologically-atypical sets of paths. But now we're leaning a lot on a pre-established harmony between the world and our prior-and-model. Q: Let me take another tack. What if calibration typically fails, but typically fails just a little — say probabilities are really $p(1 \pm \epsilon )$ when we think they're $p$. Would you be very concerned, if $\epsilon$ were small enough? A: Honestly, no, but I have no good reason to think that, in general, approximate calibration or coverage is much more common that exact calibration. Anyway, we know that credible probabilities can be radically, dismally off as coverage probabilities, so it seems like a moot point. Q: So what sense do you make of the uncertainties which come out of Bayesian procedures? A: "If we started with a population of guesses distributed like this, and then selectively bred them to match the data, here's the dispersion of the final guesses." Q: You don't think that sounds both thin and complicated? A: Of course it's both. (And it only gets more complicated if I explain "selective breeding" and "matching the data".) But it's the best sense I can make, these days, of Bayesian uncertainty quantification as she is computed. Q: And what's your alternative? A: I want to know about how differently the experiment, the estimate, could have turned out, even if the underlying reality were the same. Standard errors — or median absolute errors, etc. — and confidence sets are about that sort of uncertainty, about re-running the experiment. You might mess up, because your model is wrong, but at least there's a sensible notion of probability in there, referring to things happening in the world. The Bayesian alternative is some sort of sub-genetic-algorithm evolutionary optimization routine you are supposedly running in your mind, while I run a different one in my mind, etc. Q: But what about all the criticisms of p-values and null hypothesis significance tests and so forth? A: They all have Bayesian counterparts, as people like Andy Gelman and Christian Robert know very well. The difficulties aren't about not being Bayesian, but about things like testing stupid hypotheses, not accounting for multiple testing or model search, selective reporting, insufficient communication, etc. But now we're in danger of drifting really far from our starting point about uncertainty in estimation. Q: Would you sum that up then? A: I don't believe the uncertainties you get from just slapping a prior on something, even if you've chosen your prior so the MAP or the posterior mean matches some reasonable penalized estimator. Give me some reason to think that your posterior probabilities have some contact with reality, or I'll just see them as "quick and dirty confidence" — only often not so quick and very dirty. Q: Is that what you're going to put in your referee report? A: I'll be more polite. Disclaimer: Not a commentary on any specific talk, paper, or statistician. One reason this is a failed attempt at a dialogue is that there is more Q could have said in defense of the Bayesian approach, or at least in objection to A. (I take some comfort in the fact that it's traditional for characters in dialogues to engage in high-class trolling.) Also, the non-existent Robberbaron-Bloodmoney University is not to be confused with the very real Carnegie Mellon University; for instance, the latter lacks a hyphen. Manual trackback: Source-Filter. Posted at February 21, 2016 20:23 | permanent link ## February 20, 2016 ### "Learning Dynamics of Complex Systems from High-Dimensional Datasets" (This Week at the Statistics Seminar) Attention conservation notice: Only of interest if (1) you care about statistics and complex systems, and (2) will be in Pittsburgh on Wednesday. Sumanta Basu, "Learning Dynamics of Complex Systems from High-Dimensional Datasets" Abstract: The problem of learning interrelationships among the components of large, complex systems from noisy, high-dimensional datasets is common in many areas of modern economic and biological sciences. Examples include macroeconomic policy making, financial risk management, gene regulatory network reconstruction and elucidating functional roles of epigenetic regulators driving cellular mechanisms. In addition to their inherent computational challenges, principles statistical analyses of these big data problems often face unique challenges emerging from temporal and cross-sectional dependence in the data and complex dynamics (heterogeneity, nonlinear and high-order interactions) among the system components. In this talk, I will start with network Granger causality --- a framework for structure learning and forecasting of large dynamic systems from multivariate time series and panel datasets using regularized estimation of high-dimensional vector autoregressive models. I will discuss theoretical properties of the proposed estimates and demonstrate their advantages on a motivating application from financial econometrics --- system-wide risk monitoring of the U.S. financial sector before, during and after the crisis of 2007--2009. I will conclude with some of my ongoing works on learning nonlinear and potentially high-order interactions in high-dimensional, heterogeneous settings. I will introduce iterative Random Forest (iRF), a supervised learning algorithm based on randomized decision tree ensembles, that achieves predictive accuracy comparable to state-of-the-art learning machines and provides insight into high-order interaction relationships among features. I will demonstrate the usefulness of iRF on a motivating application from systems biology - learning epigenetic landscape of enhancer elements in Drosophila melanogaster from next generation sequencing datasets. Time and place: 4--5 pm on Wednesday, 24 February 2016, place TBA As always, the talk is free and open to the public. Posted at February 20, 2016 20:47 | permanent link ### "Likelihood-Based Methods of Mediation Analysis in the Context of Health Disparities" (Next Week at the Statistics Seminar) Attention conservation notice: Only of interest if you (1) care about evidence on how inequality matters for health, and (2) will be in Pittsbrugh on Tuesday. Therri Usher, "Likelihood-Based Methods of Mediation Analysis in the Context of Health Disparities" Abstract: African-Americans experience higher incidences of death and disability compared to non-Hispanic whites. Much of the existing research has focused on identifying the existence of health disparities, as methodological issues have hampered the development of health disparities research. In order to create solutions to eliminate health disparities, research must understand the mechanisms powering their existence. Existing causal inference tools are not suitable for studying racial health disparities, as race cannot be manipulated or changed. For the same reason, mediators stand to be useful in creating avenues to intervene on existing health disparities. Structural equation modeling (SEM) may be a more promising tool for quantifying the causal framework of health disparities. One of the most widely-used tests for assessing mediation is the Sobel test (Sobel, 1982; MacKinnon et al, 2007). However, it has disadvantages, including lower power at smaller sample sizes. Therefore, this work focuses on three varying methods for assessing mediation and compares their performance to the Sobel test. The first method is an adjustment of the Sobel test that utilizes variance estimation using random covariates. The second method utilizes the joint distribution of the mediator and the outcome to determine profile likelihoods for the estimands of interest in order to derive distributions for their estimates. Finally, the third method utilizes Bayesian modeling techniques to fit the structural equation models and estimating the probability of mediation through quantile estimation. Simulations provided evidence that all three methods demonstrated comparable estimated statistical power compared to the Sobel test, often showcasing superior power at smaller sample sizes while providing more tools of inference into the presence of mediation. The methods were applied to assess whether diet mediates the relationship between race and blood pressure in non-Hispanic black and white subjects in the National Health and Nutrition Examination Survey (NHANES) from 1999-2004. Time and place: 4:30--5:30 pm on Tuesday, 23 February 2016, in Baker Hall A51 As always, the talk is free and open to the public. Posted at February 20, 2016 20:46 | permanent link ### "Multiple Testing and Adaptive Estimation via the Sorted L-One Norm" (Next Week at the Statistics Seminar) Attention conservation notice: Only of interest if you (1) want to do high-dimensional regressions without claiming lots of discoveries which turn out to be false, and (2) will be in Pittsburgh on Monday. "But Cosma", I hear you asking, "how can you be five talks into the spring seminar series without having had a single talk about false discovery rate control? Is the CMU department feeling quite itself?" I think you for your concern, and hope this will set your mind at ease: Weijie Su, "Multiple Testing and Adaptive Estimation via the Sorted L-One Norm" Abstract: In many real-world statistical problems, we observe a large number of potentially explanatory variables of which a majority may be irrelevant. For this type of problem, controlling the false discovery rate (FDR) guarantees that most of the discoveries are truly explanatory and thus replicable. In this talk, we propose a new method named SLOPE to control the FDR in sparse high-dimensional linear regression. This computationally efficient procedure works by regularizing the fitted coefficients according to their ranks: the higher the rank, the larger the penalty. This is analogous to the Benjamini-Hochberg procedure, which compares more significant p-values with more stringent thresholds. Whenever the columns of the design matrix are not strongly correlated, we show empirically that SLOPE obtains FDR control at a reasonable level while offering substantial power. Although SLOPE is developed from a multiple testing viewpoint, we show the surprising result that it achieves optimal squared errors under Gaussian random designs over a wide range of sparsity classes. An appealing feature is that SLOPE does not require any knowledge of the degree of sparsity. This adaptivitiy to unknown sparsity has to do with the FDR control, which strikes the right balance between bias and variance. The proof of this result presents several elements not found in the high-dimensional statistics literature. Time and place: 4--5 pm on Monday, 22 February 2016, in Scaife Hall 125. As always, the talk is free and open to the public. Posted at February 20, 2016 20:45 | permanent link ## February 15, 2016 ### "Science, Counterfactuals and Free Will" (In Two Weeks, Not at the Statistics Seminar) Attention conservation notice: A distinguished but elderly scientist philosophizes in public. As a Judea Pearl fanboy, it is inevitable that I would help promote this: Judea Pearl, "Science, Counterfactuals and Free Will" (Dickson Prize Lecture) Abstract: Counterfactuals, or fictitious changes, are the building blocks of scientific thought and the oxygen of moral behavior. The ability to reflect back on one's past actions and envision alternative scenarios is the basis of learning, free will, responsibility and social adaptation. Recent progress in the algorithmization of counterfactuals has advanced our understanding of this mode of reasoning and has brought us a step closer toward equipping machines with similar capabilities. Dr. Pearl will first describe a computational model of counterfactual reasoning, and then pose some of the more difficult problems that counterfactuals present: why evolution has endowed humans with the illusion of free will, and how it manages to keep that illusion so vivid in our brain. Time and place: noon--1 pm on Monday, 29 February 2016, in McConomy Auditorium, University Center As always, the talk is free and open to the public. ObLinkage 1: Clark Glymour, "We believe in freedom of the will so that we can learn". ObLinkage 2: Mightn't that "illusion of free will" be the only sort worth wanting? Posted at February 15, 2016 16:49 | permanent link ### "Robust Causal Inference with Continuous Exposures" (This Week at the Statistics Seminar) Attention conservation notice: Only of interest if you (1) care about statistical methods for causal inference, and (2) will be in Pittsburgh on Thursday. There are whole books on causal inference which make it seem like the subject is exhausted by comparing the effect of The Treatment to the control condition. (cough Imbens and Rubin cough) But any approach to causal inference which can't grasp a dose-response curve might be sound but is not complete. Nor is there any reason, in this day and age, to stick to simple regression. Fortunately, we don't have to: Edward Kennedy, "Robust Causal Inference with Continuous Exposures" (arxiv:1507.00747) Abstract: Continuous treatments (e.g., doses) arise often in practice, but standard causal effect estimators are limited: they either employ parametric models for the effect curve, or else do not allow for doubly robust covariate adjustment. Double robustness allows one of two nuisance estimators to be misspecified, and is important for protecting against model misspecification as well as reducing sensitivity to the curse of dimensionality. In this work we develop a novel approach for causal dose-response curve estimation that is doubly robust without requiring any parametric assumptions, and which naturally incorporates general off-the-shelf machine learning. We derive asymptotic properties for a kernel-based version of our approach and propose a method for data-driven bandwidth selection. The methods are illustrated via simulation and in a study of the effect of hospital nurse staffing on excess readmissions penalties. Time and place: 4:30--5:30 pm on Thursday, 18 February 2016, in Baker Hall A51 As always, the talk is free and open to the public. Posted at February 15, 2016 16:31 | permanent link ## February 11, 2016 ### "Optimal Large-Scale Internet Media Selection" (Also Next Week at the Statistics Seminar) Attention conservation notice: Only of interest if you (1) wish you could use the power of modern optimization to allocate an online advertising budget, and (2) will be in Pittsburgh on Tuesday. I can think of at least two ways an Internet Thought Leader could make a splash by combining Tuesday's seminar with Maciej Ceglowski's "The Advertising Bubble". Fortunately for all concerned, I am not an Internet Thought Leader. Courtney Paulson, "Optimal Large-Scale Internet Media Selection" [PDF preprint] Abstract: Although Internet advertising is vital in today's business world, research on optimal Internet media selection has been sparse. Firms face considerable challenges in their budget allocation decisions, including the large number of websites they may potentially choose, the vast variation in traffic and costs across websites, and the inevitable correlations in viewership among these sites. Due to these unique features, Internet advertising problems are actually a subset of a more diverse, general class of problems: penalized and constrained optimization. Generally, attempting to select the optimal subset of websites among all possible combinations is a NP-hard problem; as such, existing non-approaches can only handle Internet media selection in settings on the order of ten websites. Further, these approaches are not generalizable. Although generalizable penalized methodology exists to handle large-scale problems, this methodology cannot incorporate natural advertising constraints, such as budget allocation to particular websites or demographic weighting. We propose an optimization method that is computationally feasible to allocate advertising budgets among thousands of websites while also incorporating these common constraints. The method performs similarly to extant approaches in settings scalable to prior methods, but the method is also flexible enough to accommodate practical Internet advertising considerations such as targeted consumer demographics, mandatory media coverage to matched content websites, and target frequency of ad exposure. Time and place: 4:30--5:30 pm on Tuesday, 16 February 2016, in Baker Hall A51 Due to winter travel delays, the talk with take place at 4:30 on Wednesday the 17th, room TBA As always, the talk is free and open to the public. Posted at February 11, 2016 20:11 | permanent link ## February 10, 2016 ### "Robust Bayesian inference via coarsening" (Next Week at the Statistics Seminar) Attention conservation notice: Only of interest if you (1) care allocating precise fractions of a whole belief over a set of mathematical models when you know none of them is actually believable, and (2) will be in Pittsburgh on Monday. As someone who thinks Bayesian inference is only worth considering under mis-specification, next week's first talk is of intense interest. Jeff Miller, "Robust Bayesian inference via coarsening" (arxiv:1506.06101) Abstract: The standard approach to Bayesian inference is based on the assumption that the distribution of the data belongs to the chosen model class. However, even a small violation of this assumption can have a large impact on the outcome of a Bayesian procedure, particularly when the data set is large. We introduce a simple, coherent approach to Bayesian inference that improves robustness to small departures from the model: rather than conditioning on the observed data exactly, one conditions on the event that the model generates data close to the observed data, with respect to a given statistical distance. When closeness is defined in terms of relative entropy, the resulting "coarsened posterior" can be approximated by simply raising the likelihood to a certain fractional power, making the method computationally efficient and easy to implement in practice. We illustrate with real and simulated data, and provide theoretical results. Time and place: 4 pm on Monday, 15 February 2016, in 125 Scaife Hall As always, the talk is free and open to the public. Posted at February 10, 2016 00:24 | permanent link ## February 03, 2016 ### "Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity" (Next Week at the Statistics Seminar) Attention conservation notice: Only of interest if (1) you care about factor analysis and Bayesian nonparametrics, and (2) will be in Pittsburgh on Monday. Constant readers, knowing of my love-hate relationship with both factor analysis and with Bayesian methods will appreciate that the only way I could possibly be more ambivalent about our next seminar was if it also involved power-law distributions. Veronika Ročková, "Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity" [preprint, preprint supplement] Abstract: Rotational post-hoc transformations have traditionally played a key role in enhancing the interpretability of factor analysis. Regularization methods also serve to achieve this goal by prioritizing sparse loading matrices. In this work, we bridge these two paradigms with a unifying Bayesian framework. Our approach deploys intermediate factor rotations throughout the learning process, greatly enhancing the effectiveness of sparsity inducing priors. These automatic rotations to sparsity are embedded within a PXL-EM algorithm, a Bayesian variant of parameter-expanded EM for posterior mode detection. By iterating between soft-thresholding of small factor loadings and transformations of the factor basis, we obtain (a) dramatic accelerations, (b) robustness against poor initializations and (c) better oriented sparse solutions. To avoid the pre-specification of the factor cardinality, we extend the loading matrix to have infinitely many columns with the Indian Buffet Process (IBP) prior. The factor dimensionality is learned from the posterior, which is shown to concentrate on sparse matrices. Our deployment of PXL-EM performs a dynamic posterior exploration, outputting a solution path indexed by a sequence of spike-and-slab priors. For accurate recovery of the factor loadings, we deploy the Spike-and-Slab LASSO prior, a two-component refinement of the Laplace prior (Rockova 2015). A companion criterion, motivated as an integral lower bound, is provided to effectively select the best recovery. The potential of the proposed procedure is demonstrated on both simulated and real high-dimensional gene expression data, which would render posterior simulation impractical. Time and place: 4 pm on Monday, 8 February 2016, in 125 Scaife Hall As always, the talk is free and open to the public. Posted at February 03, 2016 22:32 | permanent link ## January 31, 2016 ### Books to Read While the Algae Grow in Your Fur, January 2016 Attention conservation notice: I have no taste. Mark Thompson, The White War: Life and Death on the Italian Front, 1915--1919 A well-told narrative history of the war, mostly from the Italian side. He covers all aspects, from the back-and-forth of the twelve (!) battles of the Isonzo and diplomatic machinations to war literature and the cults of vitalism and "mystical sadism". One of my great-grandfathers was an engineer in the Italian army during this, and a vague tradition of a grossly incompetent, futile conflict had come down to me, but before reading this I had no idea of just how bad it was. Or just how much the Italian state's conduct of the war helped set the stage for Fascism. Tremontaine Mind candy: a "fantasy of manners", combining spherical trigonometry, the chocolate trade, aristocratic intrigue, and the authors toying with the characters' affections. It's a prequel, of sorts, to Ellen Kushner's Swordspoint, which I read long enough ago that I remember only a vague atmosphere. Apsley Cherry-Garrard, The Worst Journey in the World A deservedly-classic memoir of the British Antarctic expedition of 1910--1913. The writing is vivid, the conditions described are alternately wonderous and appalling (admittedly, much more appalling than wondrous), and the feats of physical endurance and stoicism remarkable. What's even more astonishing, now, is the sheer futility of it all. "We were primarily a great scientific expedition, with the Pole as our bait for public support, though it was not more important than any other acre of the plateau": but that publicity stunt killed five people, and thoroughly set the agenda for all the rest of the expedition. Or take Cherry-Garrard's titular "worst journey in the world", over a month on foot through the darkness of an Antarctic winter, at temperatures up to a hundred degrees Fahrenheit below freezing; by rights it should have killed the three people who attempted it, and nearly did many times over. It had more of a scientific purpose, namely to collect embryos of the Emperor penguin, but that goal was itself based on a thoroughly bad theory, that the penguins are the most "primitive" of birds, and "If penguins are primitive, it is rational to infer that the most primitive penguin is farthest south". When we talk about science advancing funeral by funeral, this is not what we have in mind. Near the end of the book, Cherry-Garrard makes a rousing call for creating, and funding, a proper scientific presence in Antarctica; whether this helped lead to the modern British Antarctic Survey I don't know, but I'd like to think so. There is an essay to be written about the anxieties about British masculinity and national degeneration on display in the opening and concluding chapters. There's another essay to be written about this as a source text for At the Mountains of Madness, for everything from the Antarctic crinoids to the fusion of doomed science and masculinity (*). Probably both of these essays have been written. They'd be worth writing because this is a great book. ObLinkage: Maciej "Idle Words" Ceglowski, Scott and Scurvy *: "Poor devils! After all, they were not evil things of their kind. They were the men of another age and another order of being. Nature had played a hellish jest on them .... [P]oor Old Ones! Scientists to the last — what had they done that we would not have done in their place? God, what intelligence and persistence! What a facing of the incredible, just as those carven kinsmen and forbears had faced things only a little less incredible! Radiates, vegetables, monstrosities, star-spawn — whatever they had been, they were men!" (At the Mountains of Madness, Chapter 11) Warren Ellis and Gianluca Pagliarani, Ignition City Warren Ellis, Declan Shalevy and Jordie Bellaire, Injection, vol. 1 Comic book mind candy. Ignition City is Ellis playing around with space opera of the old Flash Gordon / Buck Rogers mold; it's fun but not much more. Injection goes deeper, to some place where worries that the future is coming at us too fast and that we've somehow used up our ability as a culture to come up with anything new and just recycle old fads meets weird little bits of British folklore, and sets off explosions. Also, it's gorgeously drawn. Veronika Meduna, Secrets of the Ice: Antarctica's Clues to Climate, the Universe, and the Limits of Life (a.k.a. Science on Ice: Discovering the Secrets of Antarctica) Well-written and serious (but not solemn) popular book about scientific research in Antarctica, especially by scientists from or working in New Zealand, accompanied by tons of beautiful photos. My one complaint is that I kept wanting to know more. A. J. Lee,$U$-Statistics: Theory and Practice Suppose we want to estimate some attribute$\theta$of a probability distribution$F$, and we have available samples$X_1, X_2, \ldots X_n$drawn iidly from$F$. A fundamental theorem of Halmos's says that$\theta(F)$has an unbiased estimator iff$\theta(F) = \mathbb{E}_{F}[\psi(X_1, X_2, \ldots X_k)]$for some function$\psi$of$k$variables. A natural estimator would then be$\psi(X_1, X_2, \ldots X_k)$. But another unbiased estimator would be$\psi(X_{n-k+1}, X_{n-k+2}, \ldots X_n)$, and so forth; a natural impulse is to reduce the variance by averaging such estimates together. Furthermore, since the$X_i$are IID, it shouldn't matter what order we take them in, so a good estimator should be symmetric in its arguments. (Said differently, the order statistics are always sufficient statistics for an IID sample.) The$U$statistic corresponding to a symmetric kernel function$\psi$of order$k$is $U_n \equiv {n \choose k}^{-1} \sum_{i \in (n,k)} {\psi(X_{i_1}, \ldots X_{i_k})}$ where$(n,k)$runs over all ways of picking$k$distinct indices from$1:n$. If the space of distributions we're working with is not too small, then$U_n$is the unique estimator of the corresponding$\theta(F)$that is both symmetric and unbiased. Moreover,$U_n$has the minimum variance among all unbiased estimators. (If the original$\psi$was not symmetric, we can always replace it with a symmetrized version which gives the same$\theta(F)$.) Thus the basic sort of$U$statistic. Variants include not summing over all possible$k$-tuples ("incomplete"$U$-statistics), multi-sample$U$-statistics, dependent observations, etc. Unbiasedness is not, in itself, a terribly interesting property; unbiased estimators might, for instance, fail to converge. What matters more is that lots of very natural parameters or functionals can be cast in this form. (I was lead to pick this book up because, for a paper, I needed to know about how closely the actual the number of edges between two kinds of node in a network would approximate its expectation.) The terms in the average in a$U$statistic are dependent on each other, because they share arguments, e.g.,$\psi(X_1, X_2)$will be statistically dependent on$\psi(X_1, X_3)$and$\psi(X_2, X_{14})$. But this dependence has a nice combinatorial structure, which lets us re-write the$U$statistic as a sum of uncorrelated terms (the "$H$-decomposition" or "$H$-projection"). The 0th order term in this decomposition is just$\theta$; the first-order corrections are functions of the individual$X_i$(and so IID); the 2nd order corrections are symmetric functions of pairs of$X_i$s, and so forth. Since the higher-order terms in this expansion are generally of smaller order than the earlier ones, This in turn lets us give systematic formulas for things like the variance of a$U$statistic, and in general to port over much of the ordinary IID limit theory without too much trouble. This book is a good tour of the state of the statistical theory as of 1990. The first chapter covers the most basic facts about$U$statistics, rather as I've done above. The second chapter deals with variations (including, beyond those I've mentioned, independent but not identically distributed data, sampling from a finite population, and weighting terms in the sum). Chapter 3 covers asymptotics, emphasizing situations where IID methods and results carry over. Chapter 4 covers further generalizations, such as symmetric statistics which are not$U$statistics. Chapter 5 is about getting standard errors using the jackknife and the bootstrap. Finally, chapter 6 covers applications beyond those already given as examples in earlier chapters, such as testing distributions for symmetry, and testing pairs of random variables for statistical independence. The writing is clear, the organization is logical, complicated or lengthy proofs get preliminary sketches, and the references are extensive. Lee's book is a generation old; as such it looks mostly at the classical part of the theory, from its origins in the 1940s to when Lee was writing, which, like most statistical theory of the period, emphasized asymptotics. (All I needed were those asymptotics.) Since then, people in statistical learning have gotten very interested in$U$statistics because of their relationship to ranking problems. This recent work, however, has emphasized non-asymptotic, finite-$n\$ concentration results which simply weren't on Lee's horizon. I don't know that literature well enough to say whether there's a more comprehensive replacement for this book.
Karl Marx, Capital: A Critique of Political Economy, vol. I: The Process of Capitalist Production
I read this as a teenager, but the other day, one of the occasional used-book dealers who comes by campus had, in addition to the usual collection of novels from the 1970s and Dover books on mathematics, a stout little ex-library hardback of Capital, volume I. It was (perhaps appropriately) virtually free, and so I found myself moved to buy it, and then to re-read it. My reading notes have grown to other 7000 words, so I'll make them their own post when they're done.
I will say three things: (1) This could have been much shorter, and much clearer, with the benefit of ideas like "equivalence class" --- which Marx couldn't've known about. (2) The labor theory of value seems even less plausible to me now than it did as a teenager. (Back then, I suspected there were arguments for it which I was missing; now I see there aren't any.) (3) I hereby apologize to those of my humanities and social-studies teachers who I nonetheless trolled with labor-theory arguments.
C. J. Lyons, Blood Stained; Kill Zone; Hard Fall; Fight Dirty
Sequels to Snake Skin; mind candy, or perhaps mind popcorn. I confess I might not have kept up so far were it not for the perverse pleasure of seeing Lyons (and her characters) wreak havoc on familiar Pittsburgh neighborhoods. (*)
Spoiler-y nit-picking about Kill Zone and sequels: I am not going to complain about the Abominable Afghan Antagonist in Kill Zone; if anything, I regard our graduating from faceless henchmen to the ranks of active villains as a mark of progress. (Though I note this is the second novel in which I've encountered an Afghan immigrant to Pittsburgh, and the second in which they're a plotting master-mind --- is there some local news story I missed?) I also don't object to the fact that the body count in the third book is explicitly over 60 deaths in one night, and by my rough count could easily have been 120. This would put the death toll in range of the Oklahoma City bombing, and indeed last year the whole of Allegheny County had only 108 homicides. So this would be a whole year's worth of killing in one night, and, per the story, not a year's worth of, so to speak, ordinary personal and criminal killing, but a targeted attack on the institutions and personnel of government (like Oklahoma City). To imagine that this wouldn't have cataclysmic political consequences for the whole nation is absurd --- hell, Lyon's characters talk about how it will have such consequences. And then in the subsequent books, none of those consequences follow. I still read, and enjoyed, those books, but I am a bit offended by the shoddy world-building. (Cf. Timothy Burke on lack of consequences in comic books.)
Keith Sawyer, Group Genius: The Creative Power of Collaboration
For the most part, this is pretty good popular social science about the social psychology and sociology of group creativity and problem solving. It appears, however, to be pitched as a business-advice book, which leads to a certain amount of "we are told by Science! to do X", when what science actually says is lot more ambiguous. (E.g., group creativity is almost certainly not maximized by a particular mean degree in social networks.) Also, it leads him to take the perspective of employers and corporations, as opposed to individual research workers (*). So I guess I'm making the usual reviewer's complaint of wishing that Sawyer had written a different, more academic book, as opposed to the one he wanted to write. But I learned some interesting things from it, and if I was a newcomer to this area I'd have learned quite a bit.
*: E.g., when companies use websites where they throw out problems to freelance researchers and pay for one successful solution, they are shifting the risk and uncertainty of the research process on to all the individuals who try to find solutions. This is obviously good for the corporations, but bad for the researchers. (It also offers the company more bargaining power against their in-house researchers, since it improves the company's disagreement payoff.)

Posted at January 31, 2016 23:59 | permanent link

## January 27, 2016

### "Application of High-dimensional Linear Regression with Gaussian Design to Communication" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if (1) you care about the intersection of high-dimensional statistics with information theory, and (2) will be in Pittsburgh next Wednesday.

It is, perhaps, only appropriate that the first statistics seminar of the semester is about connections between high-dimensional regression, and limits on how fast information can be sent over noisy channels.

Cynthia Rush, "Application of High-dimensional Linear Regression with Gaussian Design to Communication"
Abstract: The use of smart devices and wireless networks is ubiquitous, creating a pressing need for low-complexity communications schemes that reliably deliver high data rates. In this talk, I demonstrate how I analyze the task of communicating over a noisy channel through the statistical framework of high-dimensional linear regression with Gaussian design and sparse coefficient vectors. Through this analysis, I show that theoretical bounds on the rate at which information can be communicated across a channel inform us about the minimum sample size necessary for successful support recovery, and I introduce my work on the use of computationally efficient iterative algorithms to solve such high-dimensional regression tasks.
Time and place: 4--5 pm on Wednesday, 3 February 2016, in 125 Scaife Hall

As always, the talk is free and open the public.

Posted at January 27, 2016 18:24 | permanent link

## January 09, 2016

### 36-401, Modern Regression, Fall 2015: Reflections and Lessons Learned

Attention conservation notice: Navel-gazing by an academic.

This was my first time teaching our undergraduate course on linear models ("401"). I've taught the course which follows it (402) four times, and re-designed it once, but I've never had to actually take the students through the pre-req. They come in with courses on probability, on statistical inference, and on linear algebra, but usually no real experience with data analysis. Linear regression is usually their first time trying to connect statistical models to actual data — as well as learning about how linear regression works.

I am OK with how I did, but only about OK. The three big issues I need to work on are (1) connecting theory to practice, (2) getting feedback to students faster, and (3) better assignments.

(1) I feel like I did not strike a good balance, in lecture, between theory, computational examples, and how theory guides practice. The last thing I want to do is turn out people who just (think they) know which commands to run in R, without understanding what's actually going on. (As a student put it to a colleague in a previous semester, "The difference between 401 and econometrics is that in econometrics we have to know how to do all this stuff, and in 401 we also have to know why." This was not, I believe, intended as a compliment.) But based on the student evaluations, and still more the assignments, there're still students who are a bit fuzzy about what "holding all other predictor variables constant" actually means in a linear model. But then again, based on student feedback I persistently have a problem connecting mathematical theory to data-analytic practice; more serious re-thinking of how I teach may be in order.

(2) Students need faster and more consistent feedback on their assignments. We were somewhat constrained on speed this semester by a labor shortage, but I could have done more to ensure consistency across graders.

(3) Too many of the assignments were based on small, old data sets from the textbook. Mea culpa.

This was the first time we had two sections of 401, with two separate professors. I think we did OK at coordinating them, and I take full responsibility for all the failures and glitches. (I should add, because I know some of the students read this, that grades were curved and calculated completely independently across the two sections.)

I am very grateful for the work done on designing the curriculum for this course by my colleagues. Still, I feel like a lot of the course was spent on (to be slightly unfair) special cases which people could work out in closed form in the 1920s, and pretending that they had relevance to actual data analysis. (Cf.) The Kids do need at least a nodding acquaintance with that stuff, because people will expect it of them, but I would rather they be taught it as a nice bonus rather than a default. This would mean a lot more re-design that I put into the course.

Relatedly, I came to have a thorough, almost personal, dislike of the textbook, but that's another story.

Some things which did go well:

Using Piazza for question-answering. (Thanks to Brendan O'Connor for pushing it on me.) Students were allowed to be anonymous to each other, but not to me or to the TAs, and this seemed to make sure there were no issues with trolling or general viciousness. (My plan of assigning them names of fossil animals as persistent pseudonyms proved too cumbersome to try.) Since when one student had a question, others usually had the same question, and they were good about reading what was posted, this drastically cut down on the amount of time I spent answering e-mail. (Concretely: I wrote under 600 e-mails for this class, compared to over 1000 last semester.)
Of course, since Piazza isn't charging me or my students or CMU, I am sure they have some Cunning Plan from which we will not benefit. But I will keep using the service until they break it, or their nefarious schemes come to hideous fruition.
• Encouraging the use of R Markdown. (I will make it mandatory for 402, and mandatory if I ever teach 401 again.)
• Insisting on exploratory data analysis. (Teaching them to be selective about which parts of their EDA they report needs more practice on my part.)
• I think it got through to everyone that the usual significance tests are only appropriate if the model is well-specified, so that parametric inference comes after model checking. Indeed, the former is meaningless if the model fits badly. (I also think most of them, but not quite all, got that p-values tell you nothing about which variables are important.)
• Most of them got cross-validation reasonably well, especially leave-one-out.
• The assignments based on genuine data sets seem to have gone pretty well.

I'll indulge myself by ending on on an "achievement unlocked" unlocked note. This was (so far as I know) the first class I've taught where a student's response to one of my lectures was to ask Reddit "Is there any truth to this?". There can be few better proofs that I reached at least one of my students and inspired them to think critically about the material. I am being quite serious when I say that I wish something like this happened every week in every course.

Posted at January 09, 2016 22:38 | permanent link

### Course Announcement: 36-402, Advanced Data Analysis, Spring 2016

Attention conservation notice: Only relevant if you are a student at Carnegie Mellon University, or have a pathological fondness for reading lecture notes on statistics.

In the so-called spring, I will again be teaching 36-402 / 36-608, undergraduate advanced data analysis:

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

Graduate students from other departments wishing to take this course should register for it under the number "36-608". Enrollment for 36-608 is very limited, and by permission of the professors only.

Prerequisites: 36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-608.

This will be my fifth time teaching 402, and the fifth time where the primary text is the draft of Advanced Data Analysis from an Elementary Point of View. (I hope my editor will believe that I don't intend for my revisions to illustrate Zeno's paradox.) It is the first time I will be co-teaching with the lovely and talented Max G'Sell.

Unbecoming whining: 402 will be larger this year than last, just like it has been every year I've been here. This year, in fact, we'll have over 150 students in it, or about 1/50 of all CMU undergrads. (This has nothing to do with my teaching, and everything to do with our student population.) I think it's great that we're teaching what would be masters-level material at most schools to so many juniors and seniors, but I don't think we'll be able to keep doubling every five years without either having a lot of stuff break, or transforming the nature of the course yet again. It's clearly a better problem to have than "class sizes are halving every five years"*, but it's still a problem.

*: As I have said in a number of conversations over recent years, the nightmare scenario for statistics vs. "data science" is that statistics becomes a sort of mathematical analog to classics. People might pay lip-service to our value, especially people who are invested in pretending to intellectual rigor, but few would actually pay attention to anything we have to say.

Posted at January 09, 2016 22:00 | permanent link