May 31, 2015

Books to Read While the Algae Grow in Your Fur, May 2015

Attention conservation notice: I have no taste.

Cixin Liu, The Three-Body Problem (translated by Ken Liu [no relation])
A really remarkably engrossing novel of first contact. (I will refer you to James Nicoll for plot summary.) As a novel of first contact, I think it bears comparison to some of the classics, like War of the Worlds and His Master's Voice: it realizes that aliens will be alien, and that however transformative contact might be, people will continue to be human, and to react in human ways.
— It has a lot more affinities with Wolf Totem than I would have guessed --- both a recognizably similar mode of narration, and, oddly, some of the content — educated youths rusticated to Inner Mongolia during the Cultural Revolution, environmental degradation there, and nascent environmentalism. Three-Body Problem works these into something less immediately moving, but perhaps ultimately much grimmer, than Wolf Totem. I say "perhaps" because there are sequels, coming out in translations, which I very eagerly look forward to.
Elif Shafak, The Architect's Apprentice
Historical fiction, centered on the great Ottoman architect Sinan, but told from the viewpoint of one of his apprentices. I am sure that I missed a lot of subtleties, and I half-suspect that there are allusions to current Turkish concerns which are completely over my head. (E.g., the recurrence of squatters crowding into Istanbul from the country-side seems like it might mean something...) Nonetheless, I enjoyed it a lot as high-class mind candy, and will look for more from Shafak.
ROT-13'd for spoilers: Ohg jung ba Rnegu jnf hc jvgu gur fhqqra irre vagb snagnfl --- pbagntvbhf phefrf bs vzzbegnyvgl, ab yrff! --- ng gur raq?
Barry Eichengreen, Hall of Mirrors: The Great Depression, The Great Recession, and the Uses — and Misuses — of History [Author's book site]
What it says on the label: a parallel history of the Great Depression and the Great Recession, especially in the US, and of how historical memories (including historical memories recounted as economic theories) of the former shaped the response to the latter.
If anyone actually believed in conservatism, a conservative paraphrase of Eichengreen would run something as follows: back in the day, when our ancestors came face to face with the consequences of market economies run amok, our forefathers (and foremothers) created, through a process of pragmatic trial and error, a set of institutions which allowed for an unprecedented period of stable and shared prosperity. Eventually, however, there arose an improvident generation (mine, and my parents') with no respect for the wisdom of its ancestors, enthralled by abstract theories, a priori ideologies, and Utopian social engineering, which systematically dismantled or subverted those institutions. In the fullness of time, they reaped what they had sown, namely a crisis, and a series of self-inflicted economic would, which had no precedent for fully eighty years. Enough of the ancestors' works remained intact that the results were merely awful, however, rather than the sort of utter disaster which could lead to substantial reform, or reconsideration of ideas. And here we are.
(Thanks to IB and ZMS for a copy of this.)
David Danks, Unifying the Mind: Cognitive Representations as Graphical Models
This book may have the most Carnegie Mellon-ish title ever.
Danks's program in this book is to argue that large chunks of cognitive psychology might be unified not by employing a common mental process, or kind of process, but because they use the same representations, which take the form of (mostly) directed acyclic graphical models, a.k.a. graphical causal models. In particular, he suggests that representations of this form (i) give a natural solution to the "frame problem" and other problems of determining relevance, (ii) could be shared across very different sorts of processes, and (iii) make many otherwise puzzling isolated results into natural consequences. The three domains he looks at in detail are causal cognition (*), concept formation and application, and decision-making, with hopes that this sort of representation might apply elsewhere. Danks does not attempt any very direct mapping of the relevant graphical models on to the aspects of neural activity we can currently record; this strikes me as wise, given how little we know about psychology today, and how crude our measurements of brain activity are.
Disclaimer: Danks is a faculty colleague at CMU, I know him slightly, and he has worked closely with several friends of mine (e.g.). It would have been rather awkward for me to write a very negative review of his book, but not awkward at all to have not reviewed it in the first place.
*: Interestingly to me, Danks takes it for granted that we (a) have immediate perceptions of causal relations, which (b) are highly fallible, and (c) in any case conform so poorly to the rules of proper causal models that we shouldn't try to account for them with graphical models. I wish the book had elaborated on this, or at least on (a) and (c).
F. Gregory Ashby, Statistical Analysis of fMRI Data
This is another textbook introduction, like Poldrack, Mumford and Nichols, so I'll describe it by contrast. Ashby gives very little space to actual data acquisition and pre-processing; he's mostly about what you do once you've got your data loaded into Matlab. (To be fair, this book apparently began as the text for one of two linked classes, and the other covered the earlier parts of the pipeline.) The implied reader is, evidently, a psychologist, who knows linear regression and ANOVA (and remembers there's a some sort of link between them), and has a truly unholy obsession with testing whether particular coefficients are exactly zero. (I cannot recall a single confidence interval, or even a standard error, in the whole book.) Naturally enough, this makes voxel-wise linear models the main pillars of Ashby's intellectual structure. This also explains why he justifies removing artifacts, cleaning out systematic noise, etc., not as avoiding substantive errors, but as making one's results "more significant". (I suspect this is a sound reflection of the incentives facing his readers.) To be fair, he does give very detailed presentations of the multiple-testing problem, and even ventures into Fourier analysis to look at "coherence" (roughly, the correlation between two time series at particular frequencies), Granger causality, and principle and independent component analysis [1].
This implied reader is OK with algebra and some algebraic manipulations, but needs to have their hand held a lot. Which is fine. What is less fine are the definite errors which Ashby makes. Two particularly bugged me:
  1. "The Sidak and Bonferroni corrections are useful only if the tests are all statistically independent" (p. 130): This is true of the Sidak correction but not of the Bonferroni, which allows arbitrary dependency between the tests. This mistake was not a passing glitch on that one page, but appears throughout the chapter on multiple testing, and I believe elsewhere.
  2. Chapter 10 repeatedly asserts that PCA assumes a multivariate normal distribution for the data. (This shows up again in chapter 11, by way of a contrast with ICA.) This is quite wrong; PCA can be applied so long as covariances exist. The key proposition 10.1 on p. 248 is true as stated, but it would still be true if all instances of "multivariate normal" were struck out, and all instances of "independent" were replaced with "uncorrelated". This is related to the key, distribution-free result, not even hinted at by Ashby, that the first $k$ principal components give the $k$-dimensional linear space which comes closest on average to the data points. Further, if one does assume the data came from a multivariate normal distribution, then the principle components are estimates of the eigenvectors of the distribution's covariance matrix, and so one is doing statistical inference after all, contrary to the assertion that PCA involves no statistical inference. (More than you'd ever want to know about all this.) [2]
The discussion of Granger causality is more conceptually confused than mathematically wrong. It's perfectly possible, contra p. 228, that activity in region $i$ causes activity in region $j$ and vice versa, even with "a definition of causality that includes direction"; they just need to both do so with a delay. How this would show up given the slow measurement resolution of fMRI is a tricky question, which Ashby doesn't notice. There is an even deeper logical flaw: if $i$ and $j$ are both being driven by a third source, which we haven't included, then $i$ might well help predict ("Granger cause") $j$. In fact, even if we include this third source $k$, but we measure it imperfectly, $i$ could still help us predict $j$, just because two noisy measurements are better than one [3]. Indeed, if $i$ causes $j$ but only through $k$, and the first two variables are measured noisily, we may easily get non-zero values for the "conditional Granger causality", as in Ashby's Figure 9.4. Astonishingly, Ashby actually gets this for his second worked example (p. 242), but it doesn't lead him to reconsider what, if anything, Granger causality tells us about actual causality.
While I cannot wholeheartedly recommend a book with such flaws, Ashby has obviously tried really hard to explain the customary practices of his tribe to its youth, in the simplest and most accessible possible terms. If you are part of the target audience, it's probably worth consulting, albeit with caution.
[1] Like everyone else, Ashby introduces ICA with the cocktail-party problem, but then makes it about separating speakers rather than conversations: "Speech signals produced by different people should be independent of each other" (p. 258). To be fair, I think we've all been to parties where people talk past each other without listening to a thing anyone else says, but I hope they're not typical of Ashby's own experiences.
[2] Of course, Ashby introduces PCA with a made-up example of two test scores being correlated and wanting to know if they measure the same general ability. Of course, Ashby concludes the example by saying that we can tell both tests do tap in to a common ability by their both being positively correlated with the first principal component. You can imagine my feelings.
[3] For the first case, say $X_i(t) = X_k(t) + \epsilon_i(t)$, $X_j(t) = X_k(t) + \epsilon_j(t)$, with the two noise terms $\epsilon_i, \epsilon_j$ independent, and $X_k(t)$ following some non-trivial dynamics, perhaps a moving average process. Then predicting $X_i(t+1)$ is essentially predicting $X_k(t+1)$ (and adding a little noise), and the history of $X_i$, $X_i(1:t)$, will generally contain strictly less information about $X_k(t+1)$ than will the combination of $X_i(1:t)$ and $X_j(1:t)$. For the second case, suppose we don't observe the $X$ variables, but $B=X+\eta$, with extra observational noise $\eta_t$ independent across $i$, $j$ and $k$. Then, again, conditioning on the history of $B_j$ will add information about $X_k(t+1)$, after conditioning on the history of $B_i$ and even the history of $B_k$.
Lauren Beukes, The Shining Girls
A time-traveling psycho killer (a literal murder hobo) and his haunted house versus talented and energetic ("shining") women of Chicago throughout the 20th century. I cannot decide if this is just a creepy, mildly feminist horror novel with good characterization and writing, or if Beukes is trying to say something very dark about how men suppress female ability (and, if so, whether she's wrong about us).

Books to Read While the Algae Grow in Your Fur; Minds, Brains, and Neurons; Enigmas of Chance; Scientifiction and Fantastica; Tales of Our Ancestors; The Dismal Science; The Continuing Crises; Constant Conjunction Necessary Connexion; Writing for Antiquity

Posted at May 31, 2015 23:59 | permanent link

May 22, 2015

36-402, Advanced Data Analysis, Spring 2015: Self-Evaluation and Lessons Learned

Attention conservation notice: 2000+ words of academic navel-gazing about teaching a weird class in an obscure subject at an unrepresentative school; also, no doubt, more complacent than it ought to be.

Once again, it's the brief period between submitting all the grades for 402 and the university releasing the student evaluations (for whatever they're worth), so time to think about what I did, what worked, what didn't, and what to do better.

My self-evaluation was that the class went decently, but very far from perfectly, and needs improvement in important areas. I think the subject matter is good, the arrangement is at least OK, and the textbook a good value for the price. Most importantly, the vast majority of the students appear to have learned a lot about stuff they would not have picked up without the class. Since my goal is not for the students to have fun0 but to challenge them to learn as much as possible, and assist them in doing so, I think the main objective was achieved, though not in ways which will make me beloved or even popular.

All that is much as it was in previous iterations of the class; the big changes from the last time I taught this were the assignments, using R Markdown, and the size of the class.

Writing (almost) all new assignments — ten homeworks and three exams — was good; it reduced cheating1 to negligible proportions2 and kept me interested in the material. It was also a lot more work, but I think it was worth it. Basing them on real papers, mostly but not exclusively from economics, seems to have gone over well, especially considering how many students were in the joint major in economics and statistics. (It also led to a gratifying number of students reporting crises of faith about what they were being taught in their classes in other departments.) Relatedly, having the technical content of each homework only add up to 90 points, with the remaining 10 being allocated for following a writing rubric3 seems to have led to better writing, easier grading, and I think more perception of fairness in the grading.

Encouraging the use of R Markdown so that the students' data analyses were executable and replicable was a very good call. (I have to thank Jerzy Wieczorek for over-coming my skepticism by showing me R Markdown.) In fact, I think it worked well enough that in the future I will make it mandatory, with a teaching session at the beginning of the semester (and exceptions, with permission in advance, for those who want to use knitr and LaTeX). However, I may have to reconsider my use of the np package for kernel regression, since it is very aggressive about printing out progress messages which are not useful in a report.

The big challenge of the class was sheer size. The first time I taught this class, in 2011, it had 63 students; we hit 120 this year. (And the department expects about 50% more next year.) This, of course, made it impossible to get to know most of the students — at best I got a sense of the ones who ere were regular at my office hours or spoke up in lecture, and those who sent me e-mail frequently. (Linking the faces of the former to the names of the latter remains one of my weak points.) It also means I would have gone crazy if it weren't for the very good TAs (Dena Asta, Collin Eubanks, Sangwon "Justin" Hyun and Natalie Klein), and the assistance of Xizhen Cai, acting as my (as it were) understudy — but coordinating six people for teaching is also not one of my strengths. Over the four months of the semester I sent over a thousand e-mails about the class, roughly three quarters to students and a quarter among the six of us; I feel strongly that there have to be more efficient ways of doing this part of my job.

The "quality control" samples — select six students at random every week, have them in for fifteen minutes or so to talk about what they did on the last assignment and anything that leads to, with a promise that their answers will not hurt their grades — continue to be really informative. In particular, I made a point of asking every student how long they spent on that assignment and on previous ones, and most (though not all) were within the university's norms for a nine-credit class. Some students resisted participation, perhaps because they didn't trust the wouldn't-hurt-their-grades bit; if so, I failed at "drive out fear". Also, it needs a better name, since the students keep thinking it's their quality that's being controlled, rather than that of the teaching and grading.

Things that did not work so well:

Things I am considering trying next time:

— Naturally, while proofing this before posting, the university e-mailed me the course evaluations. They were unsurprisingly bimodal.

[0] I have no objection to fun, or to fun classes, or even to students having fun in my classes; it's just not what I'm aiming at here. ^

[1] I am sorry to have to say that there are some students who have tried to cheat, by re-using old solutions. This is why I no longer put solutions on the public web, and part of why I made sure to write new assignments this time, or, if I did re-cycle, make substantial changes. ^

[2] At least, cheating that we caught. (I will not describe how we caught anyone.) ^

[3] This evolved a little over the semester; here's the final version.

The text is laid out cleanly, with clear divisions between problems and sub-problems. The writing itself is well-organized, free of grammatical and other mechanical errors, and easy to follow. Figures and tables are easy to read, with informative captions, axis labels and legends, and are placed near the text of the corresponding problems. All quantitative and mathematical claims are supported by appropriate derivations, included in the text, or calculations in code. Numerical results are reported to appropriate precision. Code is either properly integrated with a tool like R Markdown or knitr, or included as a separate R file. In the former case, both the knitted and the source file are included. In the latter case, the code is clearly divided into sections referring to particular problems. In either case, the code is indented, commented, and uses meaningful names. All code is relevant to the text; there are no dangling or useless commands. All parts of all problems are answered with actual coherent sentences, and never with raw computer code or its output. For full credit, all code runs, and the Markdown file knits (if applicable). ^

[4] The North American Mammals Paleofauna Database for homework 5 has about two thousand entries, so my thought would be to assign each student a random extinct species as their pseudonym. These should be socially neutral, and more memorable than numbers, but no doubt I'll discover that some students have profound feelings about the amphicyonidae. ^

Advanced Data Analysis from an Elementary Point of View

Posted at May 22, 2015 19:34 | permanent link

May 16, 2015

Any P-Value Distinguishable from Zero is Insufficiently Informative

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Probwrt}[2]{\mathbb{P}_{#1}\left( #2 \right)} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]

Attention conservation notice: 4900+ words, plus two (ugly) pictures and many equations, on a common mis-understanding in statistics. Veers wildly between baby stats. and advanced probability theory, without explaining either. Its efficacy at remedying the confusion it attacks has not been evaluated by a randomized controlled trial.

After ten years of teaching statistics, I feel pretty confident in saying that one of the hardest points to get through to undergrads is what "statistically significant" actually means. (The word doesn't help; "statistically detectable" or "statistically discernible" might've been better.) They have a persistent tendency to think that parameters which are significantly different from 0 matter, that ones which are insignificantly different from 0 don't matter, and that the smaller the p-value, the more important the parameter. Similarly, if one parameter is "significantly" larger than another, then they'll say the difference between them matters, but if not, not. If this was just about undergrads, I'd grumble over a beer with my colleagues and otherwise suck it up, but reading and refereeing for non-statistics journals shows me that many scientists in many fields are subject to exactly the same confusions as The Kids, and talking with friends in industry makes it plain that the same thing happens outside academia, even to "data scientists". (For example: an A/B test is just testing the difference in average response between condition A and condition B; this is a difference in parameters, usually a difference in means, and so it's subject to all the issues of hypothesis testing.) To be fair, one meets some statisticians who succumb to these confusions.

One reason for this, I think, is that we fail to teach well how, with enough data, any non-zero parameter or difference becomes statistically significant at arbitrarily small levels. The proverbial expression of this, due I believe to Andy Gelman, is that "the p-value is a measure of sample size". More exactly, a p-value generally runs together the size of the parameter, how well we can estimate the parameter, and the sample size. The p-value reflects how much information the data has about the parameter, and we can think of "information" as the product of sample size and precision (in the sense of inverse variance) of estimation, say $n/\sigma^2$. In some cases, this heuristic is actually exactly right, and what I just called "information" really is the Fisher information.

Rather than working on grant proposals Egged on by a friend As a public service, I've written up some notes on this. Throughout, I'm assuming that we're testing the hypothesis that a parameter, or vector of parameters, $\theta$ is exactly zero, since that's overwhelming what people calculate p-values for — sometimes, I think, by a spinal reflex not involving the frontal lobes. Testing $\theta=\theta_0$ for any other fixed $\theta_0$ would work much the same way. Also, $\langle x, y \rangle$ will mean the inner product between the two vectors.

1. Any Non-Zero Mean Will Become Arbitrarily Significant

Let's start with a very simple example. Suppose we're testing whether some mean parameter $\mu$ is equal to zero or not. Being straightforward folk, who follow the lessons we were taught in our one room log-cabin schoolhouse research methods class, we'll use the sample mean $\hat{\mu}$ as our estimator, and take as our test statistic $\frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}}$; that denominator is the standard error of the mean. If we're really into old-fashioned recipes, we'll calculate our p-value by comparing this to a table of the $t$ distribution with $n-2$ degrees of freedom, remembering that it's $n-2$ because we're using one degree of freedom to get the mean estimate ($\hat{\mu}$) and another to get the standard deviation estimate ($\hat{\sigma}$). (If we're a bit more open to new-fangled notions, we bootstrap.) Now what happens as $n$ grows?

Well, we remember the central limit theorem: $\sqrt{n}(\hat{\mu} - \mu) \rightarrow \mathcal{N}(0,\sigma^2)$. With a little manipulation, and some abuse of notation, this becomes \[ \hat{\mu} \rightarrow \mu + \frac{\sigma}{\sqrt{n}}\mathcal{N}(0,1) \] The important point is that $\hat{\mu} = \mu + O(n^{-1/2})$. Similarly, albeit with more algebra, $\hat{\sigma} = \sigma + O(n^{-1/2})$. Now plug these in to our formula for the test statistic: \[ \begin{eqnarray*} \frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}} & = & \sqrt{n}\frac{\hat{\mu}}{\hat{\sigma}}\\ & = & \sqrt{n}\frac{\mu + O(n^{-1/2})}{\sigma + O(n^{-1/2})}\\ & = & \sqrt{n}\left(\frac{\mu}{\sigma} + O(n^{-1/2})\right)\\ & = & \sqrt{n}\frac{\mu}{\sigma} + O(1) \end{eqnarray*} \] So, as $n$ grows, the test statistic will go to either $+\infty$ or $-\infty$, at a rate of $\sqrt{n}$, unless $\mu=0$ exactly. If $\mu \neq 0$, then the test statistic eventually becomes arbitrarily large, while the distribution we use to calculate p-values stabilizes at a standard Gaussian distribution (since that's a $t$ distribution with infinitely many degrees of freedom). Hence the p-value will go to zero as $n\rightarrow \infty$, for any $\mu\neq 0$. The rate at which it does so depends on the true $\mu$, the true $\sigma$, and the number of samples. The p-value reflects how big the mean is ($\mu$), how precisely we can estimate it ($\sigma$), and our sample size ($n$).

T-statistics calculated for five independent runs of Gaussian random variables with the specified parameters, plotted against sample size. Successive t-statistics along the same run are linked; the dashed lines are the asymptotic formulas, $\sqrt{n}\mu/\sigma$. Note that both axes are on a logarithmic scale. (Click on the image for a larger PDF version; source code.)

2. Any Non-Zero Regression Coefficient Will Become Arbitrarily Significant

Matters are much the same if instead of estimating a mean we're estimating a difference in means, or regression coefficients, or linear combinations of regression coefficients ("contrasts"). The p-value we get runs together the size of the parameter, the precision with which we can estimate the parameter, and the sample size. Unless the parameter is exactly zero, as $n\rightarrow\infty$, the p-value will converge stochastically to zero.

Even if two parameters are estimated from the same number of samples, the one with a smaller p-value is not necessarily larger; it may just have been estimated more precisely. Let's suppose we're in the land of good, old-fashioned linear regression, where $Y = \langle X, \beta \rangle + \epsilon$, where all the random variables have mean 0 (to simplify book-keeping), where $\epsilon$ is uncorrelated with $X$. Estimating $\beta$ with ordinary least squares, we get of course \[ \hat{\beta} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T \mathbf{y} ~, \] with $\mathbf{x}$ being the $n\times 2$ matrix of $X$ values and $\mathbf{y}$ the $n\times 1$ matrix of $Y$ values. Since $\mathbf{y} = \mathbf{x} \beta + \mathbf{\epsilon}$, \[ \hat{\beta} = \beta + (\mathbf{x}^T \mathbf{x})^{-1}\mathbf{x}^T \mathbf{\epsilon} ~. \] Assuming the $\epsilon$ terms are uncorrelated with each other and have constant variance $\sigma^2_{\epsilon}$, we get \[ \Var{\hat{\beta}} = \sigma^2_{\epsilon} (\mathbf{x}^T \mathbf{x})^{-1} ~. \] To understand what's really going on here, notice that $\frac{1}{n} \mathbf{x}^T \mathbf{x}$ is the sample variance-covariance matrix of $X$; call it $\hat{\mathbf{v}}$. (I give it a hat because it's an estimate of the population covariance matrix.) So \[ \Var{\hat{\beta}} = \frac{\sigma^2_{\epsilon}}{n}\hat{\mathbf{v}}^{-1} \] The standard errors for the different components of $\hat{\beta}$ are thus going to be the square roots of the diagonal entries of $\Var{\hat{\beta}}$. We will therefore estimate different regression coefficients to different precisions. To make a regression coefficient precise, the predictor variable it belongs to should have a lot of variance, and it should have little correlation with other predictor variables. (If we used an orthogonal design, $\hat{\mathbf{v}}^{-1/2}$ will be a diagonal matrix whose entries are the reciprocals of the regressors' standard deviations.) Even if we think that the size of entries in $\beta$ is telling us something about how important different $X$ variables are, one of them having a bigger variance than the other doesn't make it more important in any interesting sense.

3. Consistent Hypothesis Tests Imply Everything Will Become Arbitrarily Significant

So far, I've talked about particular cases --- about estimating means or linear regression coefficients, and even using particular estimators. But the point can be made much more generally, though at some cost in abstraction. Recall that a hypothesis test can make two kinds of error: it can declare that there's some signal when it really looks at noise (a "false alarm" or "type I" error), or it can ignroe the presence of a signal and mistake it for noise (a "miss" or "type II" error). The probability of a false alarm, when looking at noise, is called the size of a test. The probability of noticing a signal when it is present is called the power to detect the signal. A hypothesis test is consistent if its size goes 0 and its power goes to 1 as the number of data points grows. (Purists would call this a consistent sequence of hypothesis tests, but I'm trying to speak like a human being.)

Suppose that a consistent hypothesis test exists. Then at each sample size $n$, there's a range of p-values $[0,a_n]$ where we reject the noise hypothesis and claim there's a signal, and another $(a_n,1]$ where we say there's noise. Since the p-value is uniformly distributed under the noise hypothesis, the size of the test is just $a_n$, so consistency means $a_n$ must go to 0. The power of the test is the probability, in the presence of signal, that the p-value is in the rejection region, i.e., $\Probwrt{\mathrm{signal}}{P \leq a_n}$. Since, by consistency, the power is going to 1, the probability (in the presence of signal) that the p-value is less than any given value eventually goes to 1. Hence the p-value converges stochastically to 0 (again, when there's a signal). Thus, if there is a consistent hypothesis test, and there is any signal to be detected at all, the p-value must shrink towards 0.

I bring this up because, of course, the situations where people usually want to calculate p-values are in fact the ones where there usually are consistent hypothesis tests. These are situations where we have an estimator $\hat{\theta}$ of the parameter $\theta$ which is itself "consistent", i.e., $\hat{\theta} \rightarrow \theta$ in probability as $n \rightarrow \infty$. This means that with enough data, the estimate $\hat{\theta}$ will come arbitrarily close to the truth, with as much probability as we might desire. It's not hard to believe that this will mean there's a consistent hypothesis test --- just reject the null when $\hat{\theta}$ is too far from 0 --- but the next two paragraphs sketch a proof, for the sake of skeptics and quibblers.

Consistency of estimation means that for any level of approximation $\epsilon > 0$ and any level of confidence $\delta > 0$, for all $n \geq$ some $N(\epsilon,\delta,\theta)$, \[ \Probwrt{\theta}{\left|\hat{\theta}_n-\theta\right|>\epsilon} \leq \delta ~. \] This can be inverted: for any $n$ and any $\delta$, for any $\eta \geq \epsilon(n,\delta,\theta)$, \[ \Probwrt{\theta}{\left|\hat{\theta}_n-\theta\right|>\eta} \leq \Probwrt{\theta}{\left|\hat{\theta}_n-\theta\right|>\epsilon(n,\delta,\theta)} \leq \delta ~. \] Moreover, as $n\rightarrow\infty$ with $\delta$ and $\theta$ held constant, $\epsilon(n,\delta,\theta) \rightarrow 0$.

Pick any $\theta^* \neq 0$, and any $\alpha$ and $\beta > 0$ that you like. For each $n$, set $\epsilon = \epsilon(n,\alpha,0)$; abbreviate this sequence as $\epsilon_n$. I will use $\hat{\theta}_n$ as my test statistic, retaining the null hypothesis $\theta=0$ when $\left|\hat{\theta}_n\right| \leq \epsilon_n$, and reject it otherwise. By construction, my false alarm rate is at most $\alpha$. What's my miss rate? Well, again by consistency of the estimator, for any sufficiently small but fixed $\eta > 0$, if $n \geq N(|\theta^*| - \eta, \beta, \theta^*)$, then \[ \Probwrt{\theta^*}{\left|\hat{\theta}_n\right| < \eta} \leq \Probwrt{\theta^*}{\left|\hat{\theta}_n - \theta^*\right|\geq |\theta^*| - \eta} \leq \beta ~. \] (To be very close to 0, $\hat{\theta}$ has to be far from $\theta^*$.) So, if I wait until $n$ is large enough that $n \geq N(|\theta^*| - \eta, \beta, \theta^*)$ and that $\epsilon_n \leq \eta$, my power against $\theta=\theta^*$ is at least $1-\beta$ (and my false-positive rate is still at most $\alpha$). Since you got to pick pick $\alpha$ and $\beta$ arbitrarily, you can make them as close to 0 as we like, and I can still get arbitrarily high power against any alternative while still controlling the false-positive rate. In fact, you can pick a sequence of error rate pairs $(\alpha_k, \beta_k)$, with both rates going to zero, and for $n$ sufficiently large, I will, eventually, have a size less thant $\alpha_k$, and a power against $\theta=\theta^*$ greater than $1-\beta_k$. Hence, a consistent estimator implies the existence of a consistent hypothesis test. (Pedantically, we have built a universally consistent test, i.e., consistent whatever the true value of $\theta$ might be, but not necessarily a uniformly consistent one, where the error rates can be bounded independent of the true $\theta$. The real difficulty there is that there are parameter values in the alternative hypothesis $\theta \neq 0$ which come arbitrarily close to the null hypothesis $\theta=0$, and so an arbitrarily large amount of information may be needed to separate them with the desired reliability.)

4. $p$-Values for Means Should Shrink Exponentially Fast

So far, I've been arguing that the p-value should always go stochastically to zero as the sample size grows. In many situations, it's possible to be a bit more precise about how quickly it goes to zero. Again, start with the simple case of testing whether a mean is equal to zero. We saw that our test statistic $\hat{\mu}/(\hat{\sigma}/\sqrt{n}) \rightarrow \sqrt{n}\mu/\sigma + O(1)$, and that the distribution we compare this to approaches $\mathcal{N}(0,1)$. Since for a standard Gaussian $Z$ the probability that $Z > t$ is at most $\frac{\exp{\left\{-t^2/2\right\}}}{t\sqrt{2\pi}}$, the p-value in a two-sided test goes to zero exponentially fast in $n$, with the asymptotic exponential rate being $\frac{1}{2}\mu^2/\sigma^2$. Let's abbreviate the p-value after $n$ samples as $P_n$: \[ \begin{eqnarray*} P_n & = & \Prob{|Z| \geq \left|\frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}}\right|}\\ & = & 2 \Prob{Z \geq \left|\frac{\hat{\mu}}{\hat{\sigma}/\sqrt{n}}\right|}\\ & \leq & 2\frac{\exp{\left\{-n\hat{\mu}^2/2\hat{\sigma}^2\right\}}}{\sqrt{n}\hat{\mu}\sqrt{2\pi}/\hat{\sigma}}\\ \frac{1}{n}\log{P_n} & \leq & \frac{\log{2}}{n} -\frac{\hat{\mu}^2}{2\hat{\sigma}^2} - \frac{\log{n}}{2n} - \frac{1}{n}\log{\frac{\hat{\mu}}{\hat{\sigma}}} - \frac{\log{2\pi}}{n}\\ \lim_{n\rightarrow\infty}{\frac{1}{n}\log{P_n}} & \leq & -\frac{\mu^2}{2\sigma^2} \end{eqnarray*} \] Since $\Prob{Z > t}$ is also at least $\exp{\left\{-t^2/2\right\}}/(t^2+1)\sqrt{2\pi}$, a parallel argument gives a matching lower bound, $\lim_{n\rightarrow\infty}{n^{-1}\log{P_n}} \geq -\frac{1}{2}\mu^2/\sigma^2$.

P-value versus sample size, color coded as in the previous figure. Notice that even the runs where $\mu$, and $\mu/\sigma$, are very small (in green), the p-value is declining exponentially. Again, click for a larger PDF, source code here.

5. $p$-Values in General Will Often Shrink Exponentially Fast

This is not just a cute trick with Gaussian approximations; it generalizes through the magic of large deviations theory. Glossing over some technicalities, a sequence of random variables $X_1, X_2, \ldots X_n$ obey a large deviations principle when \[ \lim_{n\rightarrow\infty}{\frac{1}{n}\log{\Probwrt{}{X_n \in B}}} = -\inf_{x\in B}{D(x)} \] where $D(x) \geq 0$ is the "rate function". If the set $B$ doesn't include a point where $D(x)=0$, the probability of $B$ goes to zero, exponentially in $n$,* with the exact rate depending on the smallest attainable value of the rate function $D$ over $B$. ("Improbable events tend to happen in the most probable way possible.") Very roughly speaking, then, $\Probwrt{}{X_n \in B} \approx \exp{\left\{ - n \inf_{x\in B}{D(x)}\right\}}$. Suppose that $X_n$ is really some estimator of the parameter $\theta$, and it obeys a large deviations principle for every $\theta$. Then the rate function $D$ is really $D_{\theta}$. For consistent estimators, $D_{\theta}(x)$ would have a unique minimum at $x=\theta$. The usual estimators based on sample means, correlations, sample distributions, maximum likelihood, etc., all obey large deviations principles, at least under most of the conditions where we'd want to apply them.

Suppose we make a test based on this estimator. Under $\theta=\theta^*$, $X_n$ will eventually be within any arbitrarily small open ball $B_{\rho}$ of size $\rho$ around $\theta^*$ we care to name; the probability of its lying outside $B_{\rho}$ will be going to zero exponentially fast, with the rate being $\inf_{x\in B^c_{\rho}}{D_{\theta^*}(x)} > 0$. For small $\rho$ and smooth $D_{\theta^*}$, Taylor-expanding $D_\theta^*$ about its minimum suggests that rate will be $\inf_{\eta: \|\eta\| > \rho}{\frac{1}{2}\langle \eta, J_{\theta^*} \eta\rangle}$, $J_{\theta^*}$ being the matrix of $D$'s second derivatives at $\theta^*$. This, clearly, is $O(\rho^2)$.

The probability under $\theta = 0$ of seeing results $X_n$ lying inside $B_{\rho}$ is very different. If we've made $\rho$ small enough that $B_{\rho}$ doesn't include 0, $\Probwrt{0}{X_n \in B_{\rho}} \rightarrow 0$ exponentially fast, with rate $\inf_{x \in B_{\rho}}{D_0(x)}$. Again, if $\rho$ is small enough and $D_0$ is smooth enough, the value of the rate function should be essentially $D_0(\theta^*) + O(\rho^2)$. If $\theta^*$ in turn is close enough to 0 for a Taylor expansion, we'd get a rate of $\frac{1}{2}\langle \theta^*, J_0 \theta^*\rangle$. To repeat, this is the exponential rate at which the p-value is going to zero when we test $\theta=0$ vs. $\theta\neq 0$, and the alternative value $\theta^*$ is true. It is no accident that this is the same sort of rate we got for the simple Gaussian-mean problem.

Relating the matrix I'm calling $J$ to the Fisher information matrix $F$ needs a longer argument, which I'll present even more sketchily. The empirical distribution obeys a large deviations principle whose rate function is the Kullback-Leibler divergence, a.k.a. the relative entropy; this result is called "Sanov's theorem". For small perturbations of the parameter $\theta$, the divergence between a distribution at $\theta+\eta$ and that at $\theta$ is, after yet another Taylor expansion and a little algebra, $\langle \eta, F_{\theta} \eta \rangle$. A general result in large deviations theory, the "contraction principle", says that if the $X_n$ obey an LDP with rate function $D$, then $Y_n = h(X_n)$ obeys an LDP with rate function $D^{\prime}(y) = \inf_{x : h(x) = y}{D(x)}$. Thus an estimator which is a function of the empirical distribution, which is most of them, will have a decay rate which is at most $\langle \eta, F_{\theta} \eta \rangle$, and possibly less, if it the estimator is crude enough. (The maximum likelihood estimator in an exponential family will, however, preserve large deviation rates, because it's a sufficient statistic.)

6. What's the Use of $p$-Values Then?

Much more limited than the bad old sort of research methods class (or third referee) would have you believe. If you find a small p-value, yay; you've got enough data, with precise enough measurement, to detect the effect you're looking for, or you're really unlucky. If your p-value is large, you're either really unlucky, or you don't have enough information (too few samples or too little precision), or the parameter is really close to zero. Getting a big p-value is not, by itself, very informative; even getting a small p-value has uncomfortable ambiguity. My advice would be to always supplement a p-value with a confidence set, which would help you tell apart "I can measure this parameter very precisely, and if it's not exactly 0 then it's at least very small" from "I have no idea what this parameter might be". Even if you've found a small p-value, I'd recommend looking at the confidence interval, since there's a difference between "this parameter is tiny, but really unlikely to be zero" and "I have no idea what this parameter might be, but can just barely rule out zero", and so on and so forth. Whether there are any scientific inferences you can draw from the p-value which you couldn't just as easily draw from the confidence set, I leave between you and your referees. What you definitely should not do is use the p-value as any kind of proxy for how important a parameter is.

If you want to know how much some variable matters for predictions of another variable, you are much better off just perturbing the first variable, plugging in to your model, and seeing how much the outcome changes. If you need a formal version of this, and don't have any particular size or distribution of perturbations in mind, then I strongly suggest using Gelman and Pardoe's "average predictive comparisons". If you want to know how much manipulating one variable will change another, then you're dealing with causal inference, but once you have a tolerable causal model, again you look at what happens when you perturb it. If what you really want to know is which variables you should include in your predictive model, the answer is the ones which actually help you predict, and this is why we have cross-validation (and have had it for as long as I've been alive), and, for the really cautious, completely separate validation sets. To get a sense of just how mis-leading p-values can be as a guide to which variables actually carry predictive information, I can hardly do better than Ward et al.'s "The Perils of Policy by p-Value", so I won't.

(I actually have a lot more use for p-values when doing goodness-of-fit testing, rather than as part of parametric estimation, though even there one has to carefully examine how the model fails to fit. But that's another story for another time.)

Nearly fifty years ago, R. R. Bahadur defined the efficiency of a test as the "rate at which it makes the null hypothesis more and more incredible as the sample size increases when a non-null distribution obtains", and gave a version of the large deviations argument to say that these rates should typically be exponential. The reason he could do so was that it was clear the p-value will always go to zero as we get more information, and so the issue is whether we're using that information effectively. In another fifty years, I presume that students will still have difficulties grasping this, but I piously hope that professionals will have absorbed the point.

References:

*: For the sake of completeness, I should add that sometimes we need to replace the $1/n$ scaling by $1/r(n)$ for some increasing function $r$, e.g., for dense graphs where $n$ counts the number of nodes, $r(n)$ would typically be $O(n^2)$. ^

(Thanks to KLK for discussions, and feedback on a draft.)

Update, 17 May 2015: Fixed typos (backwards inequality sign, errant $\theta$ for $\rho$) in large deviations section.

Manual trackback: Economist's View

Enigmas of Chance

Posted at May 16, 2015 12:39 | permanent link

May 12, 2015

"The free development of each is the condition of the war of all against all": Some Paths to the True Knowledge

Attention conservation notice: A 5000+ word attempt to provide real ancestors and support for an imaginary ideology I don't actually accept, drawing on fields in which I am in no way an expert. Contains long quotations from even-longer-dead writers, reckless extrapolation from arcane scientific theories, and an unwarranted tone of patiently explaining harsh, basic truths. Altogether, academic in one of the worst senses. Also, spoilers for several of MacLeod's novels, notably but not just The Cassini Division. Written for, and cross-posted to, Crooked Timber's seminar on MacLeod, where I will not be reading the comments.

I'll let Ellen May Ngwethu, late of the Cassini Division, open things up:

The true knowledge... the phrase is an English translation of a Korean expression meaning "modern enlightenment". Its originators, a group of Japanese and Korean "contract employees" (inaccurate Korean translation, this time, of the English term "bonded laborers") had acquired their modern enlightenment from battered, ancient editions of the works of Stirner, Nietzsche, Marx, Engels, Dietzgen, Darwin, and Spencer, which made up the entire philosophical content of their labor-camp library. (Twentieth-century philosophy and science had been excluded by their employers as decadent or subversive — I forget which.) With staggering diligence, they had taken these works — which they ironically treated as the last word in modern thought — and synthesized from them, and from their own bitter experiences, the first socialist philosophy based on totally pessimistic and cynical conclusions about human nature. Life is a process of breaking down and using other matter, and if need be, other life. Therefore, life is aggression, and successful life is successful aggression. Life is the scum of matter, and people are the scum of life. There is nothing but matter, forces, space and time, which together make power. Nothing matters, except what matters to you. Might makes right, and power makes freedom. You are free to do whatever is in your power, and if you want to survive and thrive you had better do whatever is in your interests. If your interests conflict with those of others, let the others pit their power against yours, everyone for theirselves. If your interests coincide with those of others, let them work together with you, and against the rest. We are what we eat, and we eat everything. All that you really value, and the goodness and truth and beauty of life, have their roots in this apparently barren soil. This is the true knowledge. We had founded our idealism on the most nihilistic implications of science, our socialism on crass self-interest, our peace on our capacity for mutual destruction, and our liberty on determinism. We had replaced morality with convention, bravery with safety, frugality with plenty, philosophy with science, stoicism with anaesthetics and piety with immortality. The universal acid of the true knowledge had burned away a world of words, and exposed a universe of things. Things we could use.1

What I want to consider here is how people who aren't inmates of a privatized gulag could come to the true knowledge, or something very like it; how they might use it; and some of how MacLeod makes it come alive.

Their Morals and Ours

One route, of course, would be through the Marxist and especially the Trotskyist tradition; I suspect this was MacLeod's. In "Their Morals and Ours", Trotsky laid out a famous formulation of what really matters:

A means can be justified only by its end. But the end in its turn needs to be justified. From the Marxist point of view, which expresses the historical interests of the proletariat, the end is justified if it leads to increasing the power of man over nature and to the abolition of the power of man over man.

Other2 moral ideas are really expressions of self- or, especially, class- interest, indeed tools in the class struggle:

Morality is one of the ideological functions in this struggle. The ruling class forces its ends upon society and habituates it into considering all those means which contradict its ends as immoral. That is the chief function of official morality. It pursues the idea of the "greatest possible happiness" not for the majority but for a small and ever diminishing minority. Such a regime could not have endured for even a week through force alone. It needs the cement of morality. The mixing of this cement constitutes the profession of the petty-bourgeois theoreticians, and moralists. They dabble in all colors of the rainbow but in the final instance remain apostles of slavery and submission.

But if you really want to know whether something is good or bad, Trotsky says, you ask whether it really conduces to "the liberation of mankind", to "to increasing the power of man over nature and to the abolition of the power of man over man". Intentions don't matter, nor do formal similarities; what matters is whether means and acts really help advance this over-riding end. Thus, explicitly, even terrorism can be justified under conditions where it will be effective (as when Trotsky practiced it during the Civil War).

Trotsky did not, of course, have occasion to contemplate eliminating an extra-terrestrial civilization, but I think his position would have been clear.

The Historic Route

The good-means-good-for-me, might-is-right theme is also one with a long history in western philosophy, often as the dreadful fate from which philosophy will save us, but sometimes as the liberating truth which philosophy reveals. The means that something like the true knowledge could, paradoxically enough, be developed out of the classical western tradition.

The obvious way to do this would be to start from figures like Nietzsche who have said pretty similar things. Most of these 19th and 20th century figures would of course have looked on the Solar Union with utter horror, but even so there is, I think, a way there. Many of these philosophers simultaneously celebrate power and bemoan the way in which great, powerful are dragged down or confined by the weak. This creates a tension, if not an outright contradiction. Who is really more powerful? Clearly, if the mediocre masses can collectively dominate and overwhelm the individually magnificent few, the masses have more power. As Hume said, albeit in a somewhat different context, "force is always on the side of the governed". (Or again: "Such a regime could not have endured for even a week through force alone".) Someone who was willing to combine Nietzsche's celebration of power with a frank assessment of both their own power as an isolated individual and of the potential power of different groups could well end up at the true knowledge.

Even less work would be to go further back into the past, to the great figures of the 17th century, like Hobbes and, most especially, Spinoza. Here we find thinkers willing to found, if not socialism, then at least social and political life on "pessimistic and cynical conclusions about human nature". The latter's Political Treatise is quite explicit about the pessimism and the cynicism:

[M]en are of necessity liable to passions, and so constituted as to pity those who are ill, and envy those who are well off; and to be prone to vengeance more than to mercy: and moreover, that every individual wishes the rest to live after his own mind, and to approve what he approves, and reject what he rejects. And so it comes to pass, that, as all are equally eager to be first, they fall to strife, and do their utmost mutually to oppress one another; and he who comes out conqueror is more proud of the harm he has done to the other, than of the good he has done to himself. [Elwes edition, I.5]

Spinoza is equally clear that one's rights extend exactly as far as one's power3, and that the reason people band together is to increase their power4. It is precisely on this basis that Spinoza came to advocate democracy, as uniting more of the power of the people in the commonwealth, especially their powers of reasoning. Of course Spinoza's political views were not the true knowledge, but he actually provides a surprisingly close starting point, and reasoning from his premises and the stand-point of someone who knows they are not going to be at the top of the heap unless they level it all would get you most of the rest of the way there. This would include Spinoza's idea that obedience, allegiance, even solidarity are all dissolved when they are no longer advantageous.

I want to mention one more pseudo-ancestor for the true knowledge. I said before that the themes that might is right, and "good" means "good for me", are an ancient ones in the history of philosophy, but they were introduced as the awful dangers which ethics is supposed to save us from. All the way back in The Republic, we find clear statements of the idea that might is right, that the alternative to pursuing self-interest is sheer stupidity, and that cooperation emerges from alignment of interests. We are supposed to recoil from these ideas in horror, but they can only arouse horror if it seems like there's something to them5. The danger with this tactic is that the initial presentation of the amoralist ideas may end up seeming more convincing than their later refutation. (I think that's the case even in The Republic.) And then one is reduced to talking about how refusing to accept that some transcendental, unverifiable ideas are true will lead to bad-for-you consequences in this world, and the game is over.

Evolutionary Game Theory as the True Knowledge

No doubt some scholars in the Solar Union will, as I have done above, play the game of trying to find retrospective anticipations of some idea in the words of people who were really saying something else. On the other hand, at some point the true knowledge leaves its bonded-labor camps, joins up with the Sino-Soviet army, and starts expanding "from Vladivostok to Lisbon, from sea to shining sea". As it moves into the wider world, it encounters scientific knowledge considerable more up to date than Darwin and Engels. Does this set the stage for another shameful and self-defeating episode of an ideology trying desperately to hold on to a bit of fossilized science?

I actually don't see why it should. There are scientific theories nowadays which try to address the sort of questions that the true knowledge claims to answer, and I don't think the answers are really that different, though they are not usually presented so starkly.

Biologically, life is a process of assimilating matter and energy, of appropriating parts of the world to sustain itself. Nothing with a stomach is innocent of preying on other living things, and even plants survive, grow, and reproduce only by consuming their environment and re-shaping it to their convenience. The organisms which are better at appropriating and changing the world to suit themselves will live and expand at the expense of those which are worse at it. Those organisms whose acts serve their own good will do better for themselves than those which don't — whether or not that might in some extra-mundane sense be right or just. Abstract goods keep nothing alive, help nothing to grow; self-seeking is what will persist, and everything else will perish. And then when we throw these creatures together, they will inevitably compete, they will rival and oppose. Of course they can aid each other, but this aid will take the form of more effective exploitation of resources, including other life.

There is now a whole sub-field of biology devoted precisely to understanding when organisms will cooperate and assist each other, namely evolutionary game theory. It teaches us conditions for the selection of forms of reciprocity and even of solidarity, even among organisms without shared genetic interests. But those are, precisely, conditions under which the reciprocity and solidarity advance self-interest; it's cooperation in the service of selfishness.

Take the paradigm of the prisoners' dilemma, but tell it a bit differently. Alice and Babur are two bandits, who can either cooperate with each other in robbing villages and caravans, or defect by turning on each other. If they both cooperate, each will take $1,000; if they both defect, neither can steal effectively and they'll get $0. If Alice cooperates and Babur defects by turning on her, he will get $2,000 and she will lose $500, and vice versa. This has exactly the structure of the usual presentations of the dilemma, but makes it plain that "cooperation" is cooperation between Alice and Babur, and can perfectly well be cooperation in preying upon others. It's a famous finding of evolutionary game theory that a strategy of conditional cooperation, of Alice cooperating with Babur until he stops cooperating with her and vice versa, is better for those players than the treacherous, uncooperative one of their turning on each other, and that a population of conditional cooperators will resist invasion by non-cooperators6. Such strategies of cooperation in exploiting others are what the field calls "pro-social behavior"[^nbandits].

Since evolutionary game theorists are for the most part well-adjusted members of bourgeois society, neither psychopaths nor revolutionaries, they do not usually frame their conclusions with the starkness which their own theories would really justify; in this respect, there has been a decline since the glory days when von Neumann could pronounce that "It is just as foolish to complain that people are selfish and treacherous as it is to complain that the magnetic field does not increase unless the electric field has a curl." If we could revive some of that von Neumann spirit, a fair synthesis of works like The Evolution of Cooperation, The Calculus of Selfishness, A Cooperative Species, Individual Strategy and Social Structure, etc., would go something like this: "Cooperation evolves just to the extent that it both advances the self-interests of the cooperators, and each of them has enough power to make the other hurt if betrayed. Everything else is self-defeated, is 'dominated'. Typically, the gains from cooperation arise from more effectively exploiting others. Also, inside every positive-sum story about gains from cooperation, there is a negative-sum struggle over dividing those gains, a struggle where the advantage lies with the already-stronger party." A somewhat more speculative addendum would be the following: "We have evolved to like hurting those who have wronged us, or who have flouted rules we want them to follow, because our ancestors have had to rely for so many millions of years on selfish, treacherous fellow creatures, and 'pro-social punishment' is how we've kept each other in line enough to take over the world."

There is little need to elaborate on how neatly this dovetails with the true knowledge, so I won't7. This alignment is, I suspect, no coincidence.

Given these points, how do we think about choices between who to cooperate with, or even whether to cooperate at all? Look for those whose interests are aligned with yours, and where cooperation will do the most to advance your interests — to those with the most power, most closely aligned with you. To neglect to ally oneself when it would be helpful is not wicked — what has wickedness to do with any of this? — but it is stupid, because it leads to needless weakness.

At this point, or somewhere near it, the Sheenisov must have made a leap which seems plausible but not absolutely compelling. The united working class is more powerful than the other forces in capitalism, the last of the "tool-making cultures of the Upper Pleistocene". To throw in with that is to get with the strength. Why solidarity? Because it's the source of power. At the same time, it's a source of strength which can hardly tolerate other, rival powers — organized non-cooperators, capitalist and statist remnants, since they threaten it, and it them.

These arguments would apply to any sort of organism — including Jovian post-humans as well as us, and so Ellen May seems to me to have very much the worse of her argument with Mary-Lou Radiation Nation Smith:

"They're not monsters, you know. Why should you expect beings more powerful and intelligent than ourselves to be worse than ourselves? Wouldn't it be more reasonable to expect them to be better? Why should more power mean less good?" I could hardly believe I was hearing this. ... I searched for my most basic understanding, and dragged it out: "Because good means good for us!" Mary-Lou smiled encouragingly and spoke gently, as though talking someone down from a high ledge. "Yes, Ellen. But who is us? We're all — human, post-human, non-human — machines with minds in a mindless universe, and it behoves those of us with minds to work together if we can in the face of that mindless universe. It's the possibility of working together that forges an us, and only its impossibility that forces a them. That is the true knowledge as a whole — the union, and the division."8

(The worse of the argument, that is, unless Ellen May can destroy the fast folk, in which case there is no power to either unite with or to fear. "No Jovian superintelligences, no problem", as it were.)

But What If It Should Come to Be Generally Known?

As I said earlier, contemporary scientists studying the evolution of cooperation do not usually put their conclusions in such frank terms as the true knowledge. I don't even think that this is because they're reluctant to do so; I think it genuinely doesn't occur to them. (And this despite things like one of the founders of evolutionary game theory, John Maynard Smith, being an outright Marxist and ex-Communist.) Even when people like Bowles and Gintis — not Marxists, but no strangers to the leftist tradition — try to draw lessons from their work, they end up with very moderate social democracy, not the true knowledge. Since I know Bowles and Gintis, I am pretty sure that they are not holding back...

Why so few people are willing to push these ideas to (one) logical conclusion is an interesting question I cannot pretend to answer. I suspect that part of the answer has to do with people not having grown up with these ideas, so that the theories are used more to reconstruct pre-existing notions than as guides in their own right. If that's so, then a few more (academic) generations of their articulation, especially if some of the articulators should happen to have the right bullet-swallowing tendencies, could get us all the way to the true knowledge being worked out, not by bonded laborers but by biologists and economists.

This presents points where, I think, the true knowledge might not lead to the attractive-to-me Solar Union, but rather somewhere much darker. If I am a member of one of the subordinate classes, well, the strongest power locally is probably the one dominating me. Maybe solidarity with others would let me overthrow them and escape, but if that united front doesn't form, or fails, things get much, much worse for me. The true knowledge could actually justify obedience to the powers that be, if they're powerful enough, and not enough of us are united in opposition to them.

The other point of failure is this. If I am a member of an oppressing or privileged class, what lesson do I take from the true knowledge? Well, I might try to throw in my lot with the power that will win — but that means abandoning my current goods, the things which presently make me strong and enhance my life. My interest is served by allying with those who are also beneficiaries of inequality, and making sure the institutions which benefit me remain in place, or if they change alter to be even more in my favor. Members of a privileged class in the grip of moralizing superstition might sometimes be moved by pity, sympathy, or benevolence. Rulers who have themselves accepted the true knowledge will concede nothing except out of calculation that it's better for itself than the alternative. Voltaire once said something to the effect that whether or not God existed, he hoped his valet believed in Him; it might have been much more correct for Voltaire's valet to hope that his master, and still more rulers like Frederick the Great, feared an avenging God.

My somewhat depressing prospect is that our ruling classes are a lot more likely to talk themselves into the true knowledge by the evolutionary route than the rest of us are to discover revolutionary solidarity — though whether the occasional fits of benevolence on the part of rulers really make things much better than a frank embrace of their self-interest would is certainly a debatable proposition.

Clicking and Giving Offense

If anyone does want to start propagating the true knowledge, I think it would actually have pretty good prospects. A number of sociologists (Gellner, Boudon) have pointed out that really successful ideologies tend to combine two features. One is that they have a core good idea, one which makes lightbulbs go on for people. Since I can't put this better than Gellner did, I'll quote him:

The general precondition of a compelling, aura-endowed belief systems is that, at some one point at least, it should carry overwhelming, dramatic conviction. In other words, it is not enough that there should be a plague in the land, that many should be in acute distress and in fear and trembling, and that some practitioners be available who offer cure and solace, linked plausibly to the background beliefs of the society in question. All that may be necessary but it is not sufficient. Over and above the need, and over and above mere background plausibility (minimal conceptual eligibility), there must also be something that clicks, something which throws light on a pervasive and insistent and disturbing experience, something which at long last gives it a local habitation and a name, which turns a sense of malaise into an insight: something which recognizes and places an experience or awareness, and which other belief systems seem to have passed by.9

I think MacLeod gets this — look at how Ellen May talks about the true knowledge "struck home with the force of a revelation" (ch. 5, p. 89). But the click for the true knowledge is how it evades the common pitfall of attempts to work out materialist or naturalist ethics. After grounding everything in self-interest and self-assertion, there is a very strong tendency to get into mere self-assertion; "good" means "good for me, and for me alone". The true knowledge avoids this; it gives you a way of accepting that you are a transient, selfish mind in a mindless, indifferent universe, and sloughing off thousands of years of accumulated superstitious rubbish (from outright taboos and threats of the Supreme Fascist to incomprehensible commands from nowhere) — you can face the light, and escape the bullshit, and yet not be altogether a monster.

(Boudon would add something to Gellner's requirement that an ideology click: the idea should also be capable of "hyperbolic" use, of being over-applied through neglecting necessary qualifications and conditions. Arguably, the whole plot of The Cassini Division is driven by Ellen May's hyperbolization of part of the true knowledge.)

Clicking is one condition for an ideology to take off; but there's another.

Though belief systems need to be anchored in the background assumptions, in the pervasive obviousness of an intellectual climate, yet they cannot consist entirely of obvious, uncontentious elements. There are many ideas which are plainly true, or which appear to be such to those who have soaked up a given intellectual atmosphere: but their very cogency, obviousness, acceptability, makes them ineligible for serving as the distinguishing mark of membership of a charismatic community of believers. Demonstrable or obvious truths do not distinguish the believer from the infidel, and they do not excite the faithful. Only difficult beliefs can do that. And what makes a belief difficult? There must be an element both of menace and of risk. The belief must present itself in such a way that the person encountering, weighing the claim that is being made on him, can neither ignore it nor hedge his bets. His situation is such that, encountering the claim, he cannot but make a decision, and it will be a weighty one, whichever way he decides. He is obliged, by the very nature of the claim, to commit himself, one way or the other.10

The true knowledge would have this quality, that Gellner (following Kirkegaard) calls "offense", in spades.11

I'll close with two observations about this combination of click and offense. One is that it is of course very common for a certain sort of fiction, and science fiction often indulges in it. Heinlein, in particular, was very good at it, and in some ways The Cassini Division is, the color of Ellen May's hair notwithstanding, a very Heinleinian book, and Ellen May explaining the true knowledge to us is not that different from being on the receiving end of one of Heinlein's in-story lectures. (I know someone else made these points before me, but I can't remember who.) One of the things which makes me like MacLeod's books better than Heinlein's, beyond the content of the lectures appealing more to my prejudices, is that even in the story world, the ideas get opposed, and there is real argument.

The other observation is that MacLeod of course comes out of the Trotskyist tradition, part of the broader family of Communisms. During its glory days, when it was the "tragic hero of the 20th century", Communism quite certainly combined the ability to make things click with the ability to give offense. This must have been one of MacLeod's models for the true knowledge. MacLeod is not any longer any sort of Communist ("the actual effect" of Communism "was to complete the bourgeois revolution ... and to clear the ground for capitalism") or even Marxist, but there is a recurring theme in his work of some form of the "philosophy of praxis" re-appearing. One of the core Marxist ideas, going all the way back to the beginning, is that socialism isn't just an arbitrary body of ideas, but an adaptive response to the objective situation of the proletariat. Even if the very memory of the socialist movement were to vanish, it is (so the claim goes) something which life under capitalism will spontaneously regenerate. One symbol of this in MacLeod's fiction is the scene at the very end of Engine City, where a hybrid creature formed from the remains of three executed revolutionaries crawls from a mass grave. The formation of the true knowledge is another.

I don't, of course, actually believe in the true knowledge, but I find it hard to say why I shouldn't; this makes it, for me, one of MacLeod's more compelling creations. I have kept coming back to it for more than fifteen years now, and I doubt I'm done with it.


  1. The Cassini Division, ch. 5, pp. 89--90 of the 1999 Tor edition; ellipses and italics in the original.^

  2. Notice how Trotsky says the "interests of the proletariat" lie in "increasing the power of man over nature", not increasing the power of the proletariat over nature, and in "the abolition of the power of man over man", not abolishing the power of others over the proletariat (either as a whole or over its individual members). Thus he can reconcile saying that all moral ideas express a class standpoint with saying that his goals are for the benefit of all humanity. There is an implicit appeal here to an idea which goes back to Marx and Engels, that, because of the proletariat's particular class position, the only way it can pursue its interest is through universal liberation of humanity. What can one say but "how convenient"?^

  3. "every natural thing has by nature as much right, as it has power to exist and operate" (II.3); "And so the natural right of universal nature, and consequently of every individual thing, extends as far as its power: and accordingly, whatever any man does after the laws of his nature, he does by the highest natural right, and he has as much right over nature as he has power" (II.4); "whatever anyone, be he learned or ignorant, attempts and does, he attempts and does by supreme natural right. From which it follows that the law and ordinance of nature, under which all men are born, and for the most part live, forbids nothing but what no one wishes or is able to do, and is not opposed to strifes, hatred, anger, treachery, or, in general, anything that appetite suggests" (II.8); "Besides, it follows that everyone is so far rightfully dependent on another, as he is under that other's authority, and so far independent, as he is able to repel all violence, and avenge to his heart's content all damage done to him, and in general to live after his own mind. He has another under his authority, who holds him bound, or has taken from him arms and means of defence or escape, or inspired him with fear, or so attached him to himself by past favour, that the man obliged would rather please his benefactor than himself, and live after his mind than after his own" (II.9--10).^

  4. "If two come together and unite their strength, they have jointly more power, and consequently more right over nature than both of them separately, and the more there are that have so joined in alliance, the more right they all collectively will possess." (II.13).^

  5. It would be horrifying if everyone were followed around by a drooling slimy befanged monster, careful to hide itself out of our sight, which might devour any one of us without warning at any moment. A philosophy which offered to re-assure us that lurking monsters do not follow us around would arouse little interest.^

  6. The basic tit-for-tat strategy is not evolutionarily stable against invasion by more forgiving conditional cooperators, which leads to a lot of technically interesting wrinkles, which you can read about in, say, Karl Sigmund's great Games of Life. But various attempts to dethrone "strong reciprocity" (e.g., "Southampton" strategies, "zero-determinant" strategies) have all, so far as I know, proved unsuccessful.^

  7. If I were going to elaborate, I'd have a lot to say about this bit from The Cassini Division (ch. 7, p. 144): "Without power, respect is dead. But our power needn't be the capacity to destroy them — our own infants, and many lower animals, have power over us because our interests are bound up with theirs. Because we value them, and because natural selection has built that valuing into our nervous systems, to the point where we cannot even wish to change it, though no doubt if we wanted to we could. This is elementary: the second iteration of the true knowledge."^

  8. Cassini Divsion, ch. 10, p. 216, my ellipses.^

  9. The Psychoanalytic Movement: The Cunning of Unreason, first edition (Evanston, Illinois: Northwestern University Press, 1996), p. 39.^

  10. The Psychoanalytic Movement, pp. 40--41.^

  11. The Cassini Division, ch. 5, pp. 93--94: "I think about being evil. To them, I realize, we are indeed bad and harmful, but — and the thought catches my breath — we are not bad and harmful to ourselves, and that is all that matters, to us. So as long as we are actually achieving our own good, it doesn't matter how evil we are to our enemies. Our Federation will be, to them, the evil empire, the domain of dark lords; and I will be a dark lady in it. Humanity is indeed evil, from any non-human point of view. I hug my human wickedness in a shiver of delight."^

Manual Trackback: Adam Kotsko; MetaFilter

The Progressive Forces; Scientifiction and Fantastica

Posted at May 12, 2015 13:53 | permanent link

May 05, 2015

Random Linkage, May 2015

Attention conservation notice: If you'd care about these links, you've probably seen them already.

I have been very much distracted from blogging by teaching undergraduates (last semester; this semester), by supervising graduate students, and by Life. Thus even this link round-up is something I literally began years ago, and am only now posting for lack of time to do real blogging.

Linkage

Posted at May 05, 2015 22:28 | permanent link

Three-Toed Sloth