## January 31, 2016

### Books to Read While the Algae Grow in Your Fur, January 2016

Attention conservation notice: I have no taste.

Mark Thompson, The White War: Life and Death on the Italian Front, 1915--1919
A well-told narrative history of the war, mostly from the Italian side. He covers all aspects, from the back-and-forth of the twelve (!) battles of the Isonzo and diplomatic machinations to war literature and the cults of vitalism and "mystical sadism". One of my great-grandfathers was an engineer in the Italian army during this, and a vague tradition of a grossly incompetent, futile conflict had come down to me, but before reading this I had no idea of just how bad it was. Or just how much the war helped set the stage for Fascism.
Tremontaine
Mind candy: a "fantasy of manners", combining spherical trigonometry, the chocolate trade, aristocratic intrigue, and the authors toying with the characters' affections. It's a prequel, of sorts, to Ellen Kushner's Swordspoint, which I read long enough ago that I remember only a vague atmosphere.
Apsley Cherry-Garrard, The Worst Journey in the World
A deservedly-classic memoir of the British Antarctic expedition of 1910--1913. The writing is vivid, the conditions described are alternately wonderous and appalling (admittedly, much more appalling than wondrous), and the feats of physical endurance and stoicism remarkable. What's even more astonishing, now, is the sheer futility of it all. "We were primarily a great scientific expedition, with the Pole as our bait for public support, though it was not more important than any other acre of the plateau": but that publicity stunt killed five people, and thoroughly set the agenda for all the rest of the expedition. Or take Cherry-Garrard's titular "worst journey in the world", over a month on foot through the darkness of an Antarctic winter, at temperatures up to a hundred degrees Fahrenheit below freezing; by rights it should have killed the three people who attempted it, and nearly did many times over. It had more of a scientific purpose, namely to collect embryos of the Emperor penguin, but that goal was itself based on a thoroughly bad theory, that the penguins are the most "primitive" of birds, and "If penguins are primitive, it is rational to infer that the most primitive penguin is farthest south". When we talk about science advancing funeral by funeral, this is not what we have in mind.
Near the end of the book, Cherry-Garrard makes a rousing call for creating, and funding, a proper scientific presence in Antarctica; whether this helped lead to the modern British Antarctic Survey I don't know, but I'd like to think so.
There is an essay to be written about the anxieties about British masculinity and national degeneration on display in the opening and concluding chapters. There's another essay to be written about this as a source text for At the Mountains of Madness, for everything from the Antarctic crinoids to the fusion of doomed science and masculinity (*). Probably both of these essays have been written. They'd be worth writing because this is a great book.
ObLinkage: Maciej "Idle Words" Ceglowski, Scott and Scurvy
*: "Poor devils! After all, they were not evil things of their kind. They were the men of another age and another order of being. Nature had played a hellish jest on them .... [P]oor Old Ones! Scientists to the last — what had they done that we would not have done in their place? God, what intelligence and persistence! What a facing of the incredible, just as those carven kinsmen and forbears had faced things only a little less incredible! Radiates, vegetables, monstrosities, star-spawn — whatever they had been, they were men!" (At the Mountains of Madness, Chapter 11)
Warren Ellis and Gianluca Pagliarani, Ignition City
Warren Ellis, Declan Shalevy and Jordie Bellaire, Injection, vol. 1
Comic book mind candy. Ignition City is Ellis playing around with space opera of the old Flash Gordon / Buck Rogers mold; it's fun but not much more. Injection goes deeper, to some place where worries that the future is coming at us too fast and that we've somehow used up our ability as a culture to come up with anything new and just recycle old fads meets weird little bits of British folklore, and sets off explosions. Also, it's gorgeously drawn.
Veronika Meduna, Secrets of the Ice: Antarctica's Clues to Climate, the Universe, and the Limits of Life (a.k.a. Science on Ice: Discovering the Secrets of Antarctica)
Well-written and serious (but not solemn) popular book about scientific research in Antarctica, especially by scientists from or working in New Zealand, accompanied by tons of beautiful photos. My one complaint is that I kept wanting to know more.
A. J. Lee, $U$-Statistics: Theory and Practice
Suppose we want to estimate some attribute $\theta$ of a probability distribution $F$, and we have available samples $X_1, X_2, \ldots X_n$ drawn iidly from $F$. A fundamental theorem of Halmos's says that $\theta(F)$ has an unbiased estimator iff $\theta(F) = \mathbb{E}_{F}[\psi(X_1, X_2, \ldots X_k)]$ for some function $\psi$ of $k$ variables. A natural estimator would then be $\psi(X_1, X_2, \ldots X_k)$. But another unbiased estimator would be $\psi(X_{n-k+1}, X_{n-k+2}, \ldots X_n)$, and so forth; a natural impulse is to reduce the variance by averaging such estimates together. Furthermore, since the $X_i$ are IID, it shouldn't matter what order we take them in, so a good estimator should be symmetric in its arguments. (Said differently, the order statistics are always sufficient statistics for an IID sample.)
The $U$ statistic corresponding to a symmetric kernel function $\psi$ of order $k$ is $U_n \equiv {n \choose k}^{-1} \sum_{i \in (n,k)} {\psi(X_{i_1}, \ldots X_{i_k})}$ where $(n,k)$ runs over all ways of picking $k$ distinct indices from $1:n$. If the space of distributions we're working with is not too small, then $U_n$ is the unique estimator of the corresponding $\theta(F)$ that is both symmetric and unbiased. Moreover, $U_n$ has the minimum variance among all unbiased estimators. (If the original $\psi$ was not symmetric, we can always replace it with a symmetrized version which gives the same $\theta(F)$.) Thus the basic sort of $U$ statistic. Variants include not summing over all possible $k$-tuples ("incomplete" $U$-statistics), multi-sample $U$-statistics, dependent observations, etc.
Unbiasedness is not, in itself, a terribly interesting property; unbiased estimators might, for instance, fail to converge. What matters more is that lots of very natural parameters or functionals can be cast in this form. (I was lead to pick this book up because, for a paper, I needed to know about how closely the actual the number of edges between two kinds of node in a network would approximate its expectation.) The terms in the average in a $U$ statistic are dependent on each other, because they share arguments, e.g., $\psi(X_1, X_2)$ will be statistically dependent on $\psi(X_1, X_3)$ and $\psi(X_2, X_{14})$. But this dependence has a nice combinatorial structure, which lets us re-write the $U$ statistic as a sum of uncorrelated terms (the "$H$-decomposition" or "$H$-projection"). The 0th order term in this decomposition is just $\theta$; the first-order corrections are functions of the individual $X_i$ (and so IID); the 2nd order corrections are symmetric functions of pairs of $X_i$s, and so forth. Since the higher-order terms in this expansion are generally of smaller order than the earlier ones, This in turn lets us give systematic formulas for things like the variance of a $U$ statistic, and in general to port over much of the ordinary IID limit theory without too much trouble.
This book is a good tour of the state of the statistical theory as of 1990. The first chapter covers the most basic facts about $U$ statistics, rather as I've done above. The second chapter deals with variations (including, beyond those I've mentioned, independent but not identically distributed data, sampling from a finite population, and weighting terms in the sum). Chapter 3 covers asymptotics, emphasizing situations where IID methods and results carry over. Chapter 4 covers further generalizations, such as symmetric statistics which are not $U$ statistics. Chapter 5 is about getting standard errors using the jackknife and the bootstrap. Finally, chapter 6 covers applications beyond those already given as examples in earlier chapters, such as testing distributions for symmetry, and testing pairs of random variables for statistical independence. The writing is clear, the organization is logical, complicated or lengthy proofs get preliminary sketches, and the references are extensive.
Lee's book is a generation old; as such it looks mostly at the classical part of the theory, from its origins in the 1940s to when Lee was writing, which, like most statistical theory of the period, emphasized asymptotics. (All I needed were those asymptotics.) Since then, people in statistical learning have gotten very interested in $U$ statistics because of their relationship to ranking problems. This recent work, however, has emphasized non-asymptotic, finite-$n$ concentration results which simply weren't on Lee's horizon. I don't know that literature well enough to say whether there's a more comprehensive replacement for this book.
Karl Marx, Capital: A Critique of Political Economy, vol. I: The Process of Capitalist Production
I read this as a teenager, but the other day, one of the occasional used-book dealers who comes by campus had, in addition to the usual collection of novels from the 1970s and Dover books on mathematics, a stout little ex-library hardback of Capital, volume I. It was (perhaps appropriately) virtually free, and so I found myself moved to buy it, and then to re-read it. My reading notes have grown to other 7000 words, so I'll make them their own post when they're done.
I will say three things: (1) This could have been much shorter, and much clearer, with the benefit of ideas like "equivalence class" --- which Marx couldn't've known about. (2) The labor theory of value seems even less plausible to me now than it did as a teenager. (Back then, I suspected there were arguments for it which I was missing; now I see there aren't any.) (3) I hereby apologize to those of my humanities and social-studies teachers who I nonetheless trolled with labor-theory arguments.
C. J. Lyons, Blood Stained; Kill Zone; Hard Fall; Fight Dirty
Sequels to Snake Skin; mind candy, or perhaps mind popcorn. I confess I might not have kept up so far were it not for the perverse pleasure of seeing Lyons (and her characters) wreak havoc on familiar Pittsburgh neighborhoods. (*)
Spoiler-y nit-picking about Kill Zone and sequels: I am not going to complain about the Abominable Afghan Antagonist in Kill Zone; if anything, I regard our graduating from faceless henchmen to the ranks of active villains as a mark of progress. (Though I note this is the second novel in which I've encountered an Afghan immigrant to Pittsburgh, and the second in which they're a plotting master-mind --- is there some local news story I missed?) I also don't object to the fact that the body count in the third book is explicitly over 60 deaths in one night, and by my rough count could easily have been 120. This would put the death toll in range of the Oklahoma City bombing, and indeed last year the whole of Allegheny County had only 108 homicides. So this would be a whole year's worth of killing in one night, and, per the story, not a year's worth of, so to speak, ordinary personal and criminal killing, but a targeted attack on the institutions and personnel of government (like Oklahoma City). To imagine that this wouldn't have cataclysmic political consequences for the whole nation is absurd --- hell, Lyon's characters talk about how it will have such consequences. And then in the subsequent books, none of those consequences follow. I still read, and enjoyed, those books, but I am a bit offended by the shoddy world-building. (Cf. Timothy Burke on lack of consequences in comic books.)
Keith Sawyer, Group Genius: The Creative Power of Collaboration
For the most part, this is pretty good popular social science about the social psychology and sociology of group creativity and problem solving. It appears, however, to be pitched as a business-advice book, which leads to a certain amount of "we are told by Science! to do X", when what science actually says is lot more ambiguous. (E.g., group creativity is almost certainly not maximized by a particular mean degree in social networks.) Also, it leads him to take the perspective of employers and corporations, as opposed to individual research workers (*). So I guess I'm making the usual reviewer's complaint of wishing that Sawyer had written a different, more academic book, as opposed to the one he wanted to write. But I learned some interesting things from it, and if I was a newcomer to this area I'd have learned quite a bit.
*: E.g., when companies use websites where they throw out problems to freelance researchers and pay for one successful solution, they are shifting the risk and uncertainty of the research process on to all the individuals who try to find solutions. This is obviously good for the corporations, but bad for the researchers. (It also offers the company more bargaining power against their in-house researchers, since it improves the company's disagreement payoff.)

Posted at January 31, 2016 23:59 | permanent link

## January 27, 2016

### "Application of High-dimensional Linear Regression with Gaussian Design to Communication" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if (1) you care about the intersection of high-dimensional statistics with information theory, and (2) will be in Pittsburgh next Wednesday.

It is, perhaps, only appropriate that the first statistics seminar of the semester is about connections between high-dimensional regression, and limits on how fast information can be sent over noisy channels.

Cynthia Rush, "Application of High-dimensional Linear Regression with Gaussian Design to Communication"
Abstract: The use of smart devices and wireless networks is ubiquitous, creating a pressing need for low-complexity communications schemes that reliably deliver high data rates. In this talk, I demonstrate how I analyze the task of communicating over a noisy channel through the statistical framework of high-dimensional linear regression with Gaussian design and sparse coefficient vectors. Through this analysis, I show that theoretical bounds on the rate at which information can be communicated across a channel inform us about the minimum sample size necessary for successful support recovery, and I introduce my work on the use of computationally efficient iterative algorithms to solve such high-dimensional regression tasks.
Time and place: 4--5 pm on Wednesday, 3 February 2016, in 125 Scaife Hall

As always, the talk is free and open the public.

Posted at January 27, 2016 18:24 | permanent link

## January 09, 2016

### 36-401, Modern Regression, Fall 2015: Reflections and Lessons Learned

Attention conservation notice: Navel-gazing by an academic.

This was my first time teaching our undergraduate course on linear models ("401"). I've taught the course which follows it (402) four times, and re-designed it once, but I've never had to actually take the students through the pre-req. They come in with courses on probability, on statistical inference, and on linear algebra, but usually no real experience with data analysis. Linear regression is usually their first time trying to connect statistical models to actual data — as well as learning about how linear regression works.

I am OK with how I did, but only about OK. The three big issues I need to work on are (1) connecting theory to practice, (2) getting feedback to students faster, and (3) better assignments.

(1) I feel like I did not strike a good balance, in lecture, between theory, computational examples, and how theory guides practice. The last thing I want to do is turn out people who just (think they) know which commands to run in R, without understanding what's actually going on. (As a student put it to a colleague in a previous semester, "The difference between 401 and econometrics is that in econometrics we have to know how to do all this stuff, and in 401 we also have to know why." This was not, I believe, intended as a compliment.) But based on the student evaluations, and still more the assignments, there're still students who are a bit fuzzy about what "holding all other predictor variables constant" actually means in a linear model. But then again, based on student feedback I persistently have a problem connecting mathematical theory to data-analytic practice; more serious re-thinking of how I teach may be in order.

(2) Students need faster and more consistent feedback on their assignments. We were somewhat constrained on speed this semester by a labor shortage, but I could have done more to ensure consistency across graders.

(3) Too many of the assignments were based on small, old data sets from the textbook. Mea culpa.

This was the first time we had two sections of 401, with two separate professors. I think we did OK at coordinating them, and I take full responsibility for all the failures and glitches. (I should add, because I know some of the students read this, that grades were curved and calculated completely independently across the two sections.)

I am very grateful for the work done on designing the curriculum for this course by my colleagues. Still, I feel like a lot of the course was spent on (to be slightly unfair) special cases which people could work out in closed form in the 1920s, and pretending that they had relevance to actual data analysis. (Cf.) The Kids do need at least a nodding acquaintance with that stuff, because people will expect it of them, but I would rather they be taught it as a nice bonus rather than a default. This would mean a lot more re-design that I put into the course.

Relatedly, I came to have a thorough, almost personal, dislike of the textbook, but that's another story.

Some things which did go well:

Using Piazza for question-answering. (Thanks to Brendan O'Connor for pushing it on me.) Students were allowed to be anonymous to each other, but not to me or to the TAs, and this seemed to make sure there were no issues with trolling or general viciousness. (My plan of assigning them names of fossil animals as persistent pseudonyms proved too cumbersome to try.) Since when one student had a question, others usually had the same question, and they were good about reading what was posted, this drastically cut down on the amount of time I spent answering e-mail. (Concretely: I wrote under 600 e-mails for this class, compared to over 1000 last semester.)
Of course, since Piazza isn't charging me or my students or CMU, I am sure they have some Cunning Plan from which we will not benefit. But I will keep using the service until they break it, or their nefarious schemes come to hideous fruition.
• Encouraging the use of R Markdown. (I will make it mandatory for 402, and mandatory if I ever teach 401 again.)
• Insisting on exploratory data analysis. (Teaching them to be selective about which parts of their EDA they report needs more practice on my part.)
• I think it got through to everyone that the usual significance tests are only appropriate if the model is well-specified, so that parametric inference comes after model checking. Indeed, the former is meaningless if the model fits badly. (I also think most of them, but not quite all, got that p-values tell you nothing about which variables are important.)
• Most of them got cross-validation reasonably well, especially leave-one-out.
• The assignments based on genuine data sets seem to have gone pretty well.

I'll indulge myself by ending on on an "achievement unlocked" unlocked note. This was (so far as I know) the first class I've taught where a student's response to one of my lectures was to ask Reddit "Is there any truth to this?". There can be few better proofs that I reached at least one of my students and inspired them to think critically about the material. I am being quite serious when I say that I wish something like this happened every week in every course.

Posted at January 09, 2016 22:38 | permanent link

### Course Announcement: 36-402, Advanced Data Analysis, Spring 2016

Attention conservation notice: Only relevant if you are a student at Carnegie Mellon University, or have a pathological fondness for reading lecture notes on statistics.

In the so-called spring, I will again be teaching 36-402 / 36-608, undergraduate advanced data analysis:

The goal of this class is to train you in using statistical models to analyze data — as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

During the class, you will do data analyses with existing software, and write your own simple programs to implement and extend key techniques. You will also have to write reports about your analyses.

Graduate students from other departments wishing to take this course should register for it under the number "36-608". Enrollment for 36-608 is very limited, and by permission of the professors only.

Prerequisites: 36-401, with a grade of C or better. Exceptions are only granted for graduate students in other departments taking 36-608.

This will be my fifth time teaching 402, and the fifth time where the primary text is the draft of Advanced Data Analysis from an Elementary Point of View. (I hope my editor will believe that I don't intend for my revisions to illustrate Zeno's paradox.) It is the first time I will be co-teaching with the lovely and talented Max G'Sell.

Unbecoming whining: 402 will be larger this year than last, just like it has been every year I've been here. This year, in fact, we'll have over 150 students in it, or about 1/50 of all CMU undergrads. (This has nothing to do with my teaching, and everything to do with our student population.) I think it's great that we're teaching what would be masters-level material at most schools to so many juniors and seniors, but I don't think we'll be able to keep doubling every five years without either having a lot of stuff break, or transforming the nature of the course yet again. It's clearly a better problem to have than "class sizes are halving every five years"*, but it's still a problem.

*: As I have said in a number of conversations over recent years, the nightmare scenario for statistics vs. "data science" is that statistics becomes a sort of mathematical analog to classics. People might pay lip-service to our value, especially people who are invested in pretending to intellectual rigor, but few would actually pay attention to anything we have to say.

Posted at January 09, 2016 22:00 | permanent link