December 31, 2009

Output Summary

After long, long journeys, in one case going back to 2003, some papers have come out. Alphabetically by distinguished co-authors:

Aaron Clauset, CRS, and M. E. J. Newman, "Power-law distributions in empirical data", arxiv:0706.1062 = SIAM Review 51 (2009): 661--703
I wrote about this when we first submitted it. In the intervening two and a half years, many people have continued to make the baby Gauss cry by publishing, and publicizing, supposed power laws based on completely inadequate and unreliable methods. Because their methods are unsound, one has no idea whether they're right or not, short of re-analyzing the data properly. I sometimes imagine these authors singing
I could be right
I could be wrong
I feel nice when I sing this song
but many of them at least pretend to care about the truth of their claims, so I piously hope that in the fullness of time the community of inquirers will come around to using reliable methods. In which regard I am gratified, but also astonished, to see that this is already the most-cited paper I've contributed to, by such a large margin that it's unlikely anything else I do will ever rival it.
See also: Aaron.
Rob Haslinger, Kristina Klinkner and CRS, "The Computational Structure of Spike Trains", Neural Computation 22 (2010): 121--157
I haven't written about this one before, though I feel free to do so now that we're published. (There'll be an arxiv version right after the new year, tentatively arxiv:1001.0036; I'll update when this is definite.) This was fun venture into applying state-reconstruction ideas, specifically CSSR, to neural spike trains, specifically the barrel cortex of the rat, which is it represents sensory input from the whiskers. (The experimentalists build special whisker-vibrating machines, which are actually quite impressive.) We do, I think, a pretty good job of predicting the spike trains in an entirely non-parametric way, and showing how their complexity is modulated by sensory stimuli — how much tweaking the whisker drives the cortical neuron.
CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified Models", arxiv:0901.1342 = Electronic Journal of Statistics 3 (2009): 1039--1074
I also wrote about this when I first submitted it. I'm particularly grateful to one of the reviewers, who read the paper very carefully, totally got it, and provided many helpful suggestions, one of which grew into a new theorem on rates of convergence. Thank you, benevolent and thoughtful anonymous referee person! Also, the publication process at EJS was extremely fast and utterly painless.

Other output: my first hemi-demi-semi-co-supervised student graduating with his doctorate (a fine piece of work I wish I could link to); a paper draft finished and sitting on a collaborator's desk (no pressure!); the homophily paper is almost finished (I need to speed up some simulations and cut out most of the jokes); half-a-dozen referee reports of my own (a deliberate new low; made easier by boycotting Elsevier); five papers edited for Annals of Applied Statistics (a new high); nine lectures newly written or massively revised for 36-350; all the problem sets for 350 re-worked and much better; three books reviewed for American Scientist (and a whole bunch of mini-reviews for nowhere in particular).

On the other hand, no chapters finished for Statistical Analysis of Complex Systems; three very patient collaborators in different parts of Manhattan waiting for me to turn things around; and one project which has been accreting since 2007 really needs to be cut and polished into some papers. Resolution for next year: more papers.

Self-Centered; Enigmas of Chance; Complexity; Power Laws; Minds, Brains, and Neurons

Posted by crshalizi at December 31, 2009 18:45 | permanent link

December 28, 2009

Significance, Power, and the Will to Believe

Attention conservation notice: 2100 words on parallels between statistical hypothesis testing and Jamesian pragmatism; an idea I've been toying with for a decade without producing anything decisive or practical. Contains algebraic symbols and long quotations from ancient academic papers. Also some history-of-ideas speculation by someone who is not a historian.

When last we saw the Neyman-Pearson lemma, we were looking at how to tell whether a data set x was signal or noise, assuming that we know the statistical distributions of noise (call it p) and the distribution of signals (q). There are two kinds of mistake we can make here: a false alarm, saying "signal" when x is really noise, and a miss, saying "noise" when x is really signal. What Neyman and Pearson showed is that if we fix on a false alarm rate we can live with (a probability of mistaking noise for signal; the "significance level"), there is a unique optimal test which minimizes the probability of misses --- which maximizes the power to detect signal when it is present. This is the likelihood ratio test, where we say "signal" if and only if q(x)/p(x) exceeds a certain threshold picked to control the false alarm rate.

The Neyman-Pearson lemma comes from their 1933 paper; but the distinction between the two kinds of errors, which is clearly more fundamental. Where does it come from?

The first place Neyman and/or Pearson use it, that I can see, is their 1928 paper (in two parts), where it's introduced early and without any fanfare. I'll quote it, but with some violence to their notation, and omitting footnoted asides (from p. 177 of part I; "Hypothesis A" is what I'm calling "noise"):

Setting aside the possibility that the sampling has not been random or that the population has changed during its course, x must either have been drawn randomly from p or from q, where the latter is some other population which may have any one of an infinite variety of forms differing only slightly or very greatly from p. The nature of the problem is such that it is impossible to find criteria which will distinguish exactly between these alternatives, and whatever method we adopt two sources of error must arise:
  1. Sometimes, when Hypothesis A is rejected, x will in fact have been drawn from p.
  2. More often, in accepting Hypothesis A, x will really have been drawn from q.

In the long run of statistical experience the frequency of the first source of error (or in a single instance its probability) can be controlled by choosing as a discriminating contour, one outside which the frequency of occurrence of samples from p is very small — say, 5 in 100 or 5 in 1000. In the density space such a contour will include almost the whole weight of the field. Clearly there will be an infinite variety of systems from which it is possible to choose a contour satisfying such a condition....

The second source of error is more difficult to control, but if wrong judgments cannot be avoided, their seriousness will at any rate be diminished if on the whole Hypothesis A is wrongly accepted only in cases where the true sampled population, q, differs but slightly from p.

The 1928 paper goes on to say that, intuitively, it stands to reason that the likelihood ratio is the right way to accomplish this. The point of the 1933 paper is to more rigorously justify the use of the likelihood ratio (hence the famous "lemma", which is really not set off as a separate lemma...). Before unleashing the calculus of variations, however, they warm up with some more justification (pp. 295--296 of their 1933):
Let us now for a moment consider the form in which judgments are made in practical experience. We may accept or we may reject a hypothesis with varying degrees of confidence; or we may decide to remain in doubt. But whatever conclusion is reached the following position must be recognized. If we reject H0, we may reject it when it is true; if we accept H0, we may be accepting it when it is false, that is to say, when really some alternative Ht is true. These two sources of error can rarely be eliminated completely; in some cases it will be more important to avoid the first, in others the second. We are reminded of the old problem considered by LAPLACE of the number of votes in a court of judges that should be needed to convict a prisoner. Is it more serious to convict an innocent man or to acquit a guilty? That will depend upon the consequences of the error; is the punishment death or fine; what is the danger to the community of released criminals; what are the current ethical views on punishment? From the point of view of mathematical theory all that we can do is to show how the risk of the errors may be controlled and minimised. The use of these statistical tools in any given case, in determining just how the balance should be struck, must be left to the investigator.
(Neither Laplace nor LAPLACE, are mentioned in their 1928 paper.)

Let's step back a little bit to consider the broader picture here. We have a question about what the world is like --- which of several conceivable hypotheses is true. Some hypotheses are ruled out on a priori grounds, others because they are incompatible with evidence, but that still leaves more than one admissible hypothesis, and the evidence we have does not conclusively favor any of them. Nonetheless, we must chose one hypothesis for purposes of action; at the very least we will act as though one of them is true. But we may err just as much through rejecting a truth as through accepting a falsehood. The two errors are symmetric, but they are not the same error. In this situation, we are advised to pick a hypothesis based, in part, on which error has graver consequences.

This is precisely the set-up of William James's "The Will to Believe". (It's easily accessible online, as are summaries and interpretations; for instance, an application to current controversies by Jessa Crispin.) In particular, James lays great stress on the fact that what statisticians now call Type I and Type II errors are both errors:

There are two ways of looking at our duty in the matter of opinion, — ways entirely different, and yet ways about whose difference the theory of knowledge seems hitherto to have shown very little concern. We must know the truth; and we must avoid error, — these are our first and great commandments as would-be knowers; but they are not two ways of stating an identical commandment, they are two separable laws. Although it may indeed happen that when we believe the truth A, we escape as an incidental consequence from believing the falsehood B, it hardly ever happens that by merely disbelieving B we necessarily believe A. We may in escaping B fall into believing other falsehoods, C or D, just as bad as B; or we may escape B by not believing anything at all, not even A.

Believe truth! Shun error! — these, we see, are two materially different laws; and by choosing between them we may end by coloring differently our whole intellectual life. We may regard the chase for truth as paramount, and the avoidance of error as secondary; or we may, on the other hand, treat the avoidance of error as more imperative, and let truth take its chance. Clifford ... exhorts us to the latter course. Believe nothing, he tells us, keep your mind in suspense forever, rather than by closing it on insufficient evidence incur the awful risk of believing lies. You, on the other hand, may think that the risk of being in error is a very small matter when compared with the blessings of real knowledge, and be ready to be duped many times in your investigation rather than postpone indefinitely the chance of guessing true. I myself find it impossible to go with Clifford. We must remember that these feelings of our duty about either truth or error are in any case only expressions of our passional life. Biologically considered, our minds are as ready to grind out falsehood as veracity, and he who says, "Better go without belief forever than believe a lie!" merely shows his own preponderant private horror of becoming a dupe. He may be critical of many of his desires and fears, but this fear he slavishly obeys. He cannot imagine any one questioning its binding force. For my own part, I have also a horror of being duped; but I can believe tbat worse things tban being doped may happen to a man in this world: so Clifford's exhortation has to my ears a thoroughly fantastic sound. It is like a general informing his soldiers that it is better to keep out of battle forever than to risk a single wound. Not so are victories either over enemies or over nature gained. Our errors are surely not such awfully solemn things. In a world where we are so certain to incur them in spite of all our caution, a certain lightness of heart seems healthier than this excessive nervousness on their behalf. At any rate, it seems the fittest thing for the empiricist philosopher.

From here the path to James's will to believe is pretty clear, at least in the form he advocated it, which is that of picking among hypotheses which are all "live"*, and where some choice must be made among them. What I am interested in, however, is not the use James made of this distinction, but simply the fact that he made it.

So far as I have been able to learn, no one drew this distinction between seeking truth and avoiding error before James, or if they did, they didn't make anything of it. (Even for Pascal in his wager, the idea that believing in Catholicism if it is false might be bad doesn't register.) Yet this is just what Neyman and Pearson were getting at, thirty-odd years later. There is no mention of James in these papers, or indeed of any other source. They present the distinction as though it were obvious, though eight decades of subsequent teaching experience shows its anything but. Neyman and Pearson were very interested in the foundations of statistics, but seem to have paid no attention to earlier philosophers, except for the arguable case of Pearson's father Karl and his Grammar of Science (which does not seem to mention James). Yet there it is. It really looks like two independent inventions of the whole scheme for judging hypotheses.

My prejudices being what they are, I am much less inclined to think that James illuminates Neyman and Pearson than the other way around. James was, so to speak, arguing that we should trade significance — the risk of mistaking noise for signal — for power, finding some meaningful signal in the "blind molecular chaos" of the physical universe. Granting that there is a trade-off here, however, one has to wonder about how stark it really is (cf.), and whether his will-to-believe is really the best way to handle it. Neyman and Pearson suggest we should look for a procedure for resolving metaphysical questions which maximizes the ability to detect larger meanings for a given risk of seeing faces in clouds — and would let James and Clifford set their tolerance for that risk to their own satisfaction. Of course, any such procedure would have to squarely confront the fact that there may be no way of maximizing power against multiple alternatives simultaneously...

The extension to confidence sets, consisting of all hypotheses not rejected by suitably powerful tests (per Neyman 1937) is left as an exercise to the reader.

*: As an example of a "dead" hypothesis, James gives believing in "the Mahdi", presumably Muhammad Ahmad ibn as-Sayyid Abd Allah. I'm not a Muslim, and those of my ancestors who were certainly weren't Mahdists, but this was still a "What do you mean 'we', white man?" moment in my first reading of the essay. To be fair, James gives me many fewer such moments than most of his contemporaries.

Manual trackback: Brad DeLong.

Enigmas of Chance; Philosophy; Modest Proposals

Posted by crshalizi at December 28, 2009 00:08 | permanent link

December 08, 2009

Uniform Probability on Infinite Spaces Considered Harmful

Attention conservation notice: 1000 words on a short probability problem. Several hobby-horses get to take turns around the yard.

Wolfgang at Mostly Harmless poses a problem (I've lightly tweaked the notation):

Consider a random process X(t) which generates a series of 0s and 1s, but many more 0s because the probability for X(t) = 1 decreases with t as 2-t.

Now assume that we encounter this process not knowing 'how far we are already', in other words we don't know the value of t. The question is: "What is the probability to get a 1?"

Unfortunately there are two ways to answer this question. The first calculates the 'expectation value', as a physicist would call it, or 'the mean' as a statistician would put it, which is zero.

In other words, we sum over all possible t with equal weight and have to consider s = sum( 2-t ) with t = 1, 2, ... N; It is not difficult to see that s = 1/2 + 1/4 + ... equals 1.

The answer is therefore Pr(X=1) = s/N = 1/N and because N is infinite (the process never ends) we get Pr(X=1) = 0.

The second answer simply looks at the definition of the process and points out that Pr(X=1) = 2-T, where T is the current value of t. Although we don't know T it must have some finite value and it is obvious that Pr(X=1) > 0.

So which one is it, Pr(X=1) = 0 or Pr(X=1) > 0?

Fatwa: The second answer is correct, and the first is wrong.

Discussion: This is a cute example of the weirdness which results when we attempt to put uniform distributions on infinite spaces, even in the simplest possible case of the positive integers. The first way of proceeding assumes that the notion of a uniform probability distribution on the natural numbers makes sense, and that it obeys the same rules as an ordinary probability distribution. Unfortunately, these two requirements are incompatible. This is because ordinary probability distributions are countably additive. We are all familiar with the fact that probability adds across disjoint events: Pr(X= 0 or 1) = Pr(X=0)+Pr(X=1). Moreover, we are all comfortable with the idea that this holds for more than two events. The probability that X first =1 by time 3 is the sum of the probability of the first 1 being at time t=1, plus it being at t=2, plus it being at t=3. Carrying this out to any finite collection of disjoint events is called finite additivity. However, as I said, probability measures are ordinarily required to be countably additive, meaning that this holds even for a countable infinity of disjoint events.

And here we have trouble. The natural numbers are (by definition!) countable, so the probability of all integers is the sum of the probability of each integer,

Pr(T an integer) = sum(Pr(T=t))

The left-hand side must be 1. For a uniform distribution, we expect that all the terms in the sum on the right-hand side must be equal, otherwise it's not "uniform". But either all the terms are equal and positive, in which case the right-hand side is infinite, or all the terms are equal and zero, in which case the right-hand side is zero. Hence, there is no countably-additive uniform probability measure on the integers, and the first approach, which leads to the conclusion that Pr(X(T)=1)=0, is mathematically incoherent.

Now, there are such things as finitely-additive probability measures, but they are rather weird beasts. To specify one of them on the integers, for example, it's not enough to give the probability of each integer (as it is for a countably-additive measure); that only pins down the probability of finite sets, and sets whose complements are finite. It does not, for example, specify the probability of the even numbers. There turn out to be several different ways of defining uniform distributions on the natural numbers, which are not equivalent. Under all of them, however, any finite set must have probability zero, and so at a random time T, it is almost certain that Pr(X(T)=1) is less than any real number you care to name. Hence, the expectation value of this random probability is indeed zero.

(Notice, however, that if I try to calculate the expectation value of any function f(t) by taking a probability-weighted sum over values of t, as the first answer does, I will get the answer 0 when T follows a uniform finitely-additive measure, even if f(t)=1 for all t. The weighted-sum-of-arguments definition of expectation — the one reminiscent of Riemann integrals — does not work for these measures. Instead one must use a Lebesgue-style definition, where one takes a weighted sum of the values of the function, the weights being the measures of the sets giving those values. [More exactly, one partitions the range of f and takes the limit as the partition becomes finer and finer.] The equivalence of the summing over the domain and summing over the range turns on, precisely, countable additivity. The argument in the previous paragraph shows that here this expectation value must be less than any positive number, yet not negative, hence zero.)

Finitely-additive probability measures are profoundly weird beasts, though some of my colleagues have what I can only consider a perverse affection for them. On the other hand, attempts to construct a natural countably-additive analog of a uniform distribution on infinite sets have been universally unsuccessful; this very much includes the maximum entropy approach. The project of giving ignorance a unique representation as a probability measure is, IMSAO, a failure. If one picks some countably-additive prior distribution over the integers, however, then at least one value of t must have strictly positive probability, and the expectation value of Pr(X(T)=1) is positive, though how big it is will depend on the prior distribution. (As usual, the role of a Bayesian prior distribution is to introduce bias so as to reduce variance.) Alternately, one simply follows the second line of reasoning and concludes that, no matter what t might be, the probability is positive.

Enigmas of Chance

Posted by crshalizi at December 08, 2009 10:55 | permanent link

December 05, 2009

36-350, Data Mining: Course Materials (Fall 2009)

My lesson-plan having survived first contact with the enemy students, it's time to start posting the lecture handouts & c. This page will be updated as the semester goes on; the RSS feed for it should be here. The class homepage has more information.

  1. Introduction to the course (24 August) What is data mining? how is it used? where did it come from? Some themes.
  2. Information retrieval and similarity searching I (26 August) Finding the data you are looking for. Ideas we will avoid: meta-data and cataloging; meanings. Textual features. The bag-of-words representation; its vector form. Measuring similarity and distance for vectors. Example with the New York Times Annotated Corpus.
  3. IR continued (28 August). The trick to searching: queries are documents. Search evaluation: precision, recall, precision-recall curves; error rates. Classification: nearest neighbors and prototypes; classifier evaluation by mis-classification rate and by confusion matrices. Inverse document frequency weighting. Visualizing high-dimensional data by multi-dimensional scaling. Miscellaneous topics: stemming, incorporating user feedback.

    Homework 1, due 4 September: assignment, R, data; solutions

  4. Page Rank (31 August). Links as pre-existing feedback. How to exploit link information? The random walk on the graph; using the ergodic theorem. Eigenvector formulation of page-rank. Combining page-rank with textual features. Other applications. Further reading on information retrieval.
  5. Image Search, Abstraction and Invariance (2 September). Similarity search for images. Back to representation design. The advantages of abstraction: simplification, recycling. The bag-of-colors representation. Examples. Invariants. Searching for images by searching text. An example in practice. Slides for this lecture.
  6. Information Theory I (4 September). Good features help us guess what we can't represent. Good features discriminate between different values of unobserved variables. Quantifying uncertainty with entropy. Quantifying reduction in uncertainty/ discrimination with mutual information. Ranking features based on mutual information. Examples, with code, of informative words for the Times. Code.
    Supplementary reading: David P. Feldman, Brief Tutorial on Information Theory, chapter 1

    Homework 2, due 11 September: assignment; solutions text and R code

  7. Information Theory II (9 September). Dealing with multiple features. Joint entropy, the chain rule for entropy. Information in multiple features. Conditional information, chain rule for information, conditional independence. Interactions, positive and negative, and redundancy. Greedy feature selection with low redundancy. Example, with code, of selecting words for the Times. Sufficient statistics and the information bottleneck. Code.
    Supplementary reading; Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", arxiv:cs.AI/0308002
  8. Categorization; Clustering I (11 September). Dividing the world up into categories. Classification: known categories with labeled examples. Taxonomy of learning problems (supervised, unsupervised, semi-supervised, feedback, ...). Clustering: discovering unknown categories from unlabeled data. Benefits of clustering, with an digression on where official classes come from. Basic criterion for good clusters: lots of information about features from little information about cluster. Practical considerations: compactness, separation, parsimony, balance. Doubts about parsimony and balance. The k-means clustering algorithm, or unlabeled prototype classification: analysis, geometry, search. Appendix: geometric aspects of the prototype and nearest-neighbor method.

    Homework 3, due 18 September: assignment; solutions

  9. Clustering II (14 September). Distances between partitions; variation-of-information distance. Hierarchical clustering by agglomeration and its varieties. Picking the number of clusters by merging costs. Performance of different clustering methods on various doodles. Why we would like to pick the number of clusters by predictive performance, and why it is hard to do at this stage. Reifying clusters.
  10. Transformations: Rescaling and Low-Dimensional Summaries (16 September). Improving on our original features. Re-scaling, standardization, taking logs, etc., of individual features. Forcing things to be Gaussian considered harmful. Low-dimensional summaries by combining features. Exploiting geometry to eliminate redundancy. Projections on to linear subspaces. Searching for structure-preserving projections.
  11. Principal Components I (18 September). Principal components are the directions of maximum variance. Derivation of principal components as the best approximation to the data in a linear subspace. Equivalence to variance maximization. Avoiding explicit optimization by finding eigenvalues and eigenvectors of the covariance matrix. Example of principal components with cars; how to tell a sports car from a minivan. The standard recipe for doing PCA. Cautions in interpreting PCA. Data-set used in the notes.

    Homework 4, due 25 September: assignment; solutions

  12. Principal Components II (21 September). PCA + information retrieval = latent semantic indexing; why LSI is a Good Idea. PCA and multidimensional scaling.
  13. Factor Analysis (23 and 25 September). From PCA to factor analysis by adding noise. Roots of factor analysis in causal discovery: Spearman's general factor model and the tetrad equations. Problems with estimating factor models: number of equations does not equal number of unknowns. Solution 1, "principal factors", a.k.a. estimation through heroic feats of linear algebra. Solution 2, maximum likelihood, a.k.a. estimation through imposing distributional assumptions. The rotation problem: the factor model is unidentifiable; the number of factors may be meaningful, but the individual factors are not.
  14. The Truth about PCA and Factor Analysis (28 September) PCA is data reduction without any probabilistic assumptions about where the data came from. Picking number of components. Faking predictions from PCA. Factor analysis makes stronger, probabilistic assumptions, and delivers stronger, predictive conclusions --- which could be wrong. Using probabilistic assumptions and/or predictions to pick how many factors. Factor analysis as a first, toy instances of a graphical causal model. The rotation problem once more with feeling. Factor models and mixture models. Factor models and Thomson's sampling model: an outstanding fit to a model with a few factors is actually evidence of a huge number of badly measured latent variables. Final advice: it all depends, but if you can only do one, try PCA. R code for the Thomson sampling model.
  15. Nonlinear Dimensionality Reduction I: Locally Linear Embedding (5 October). Failure of PCA and all other linear methods for nonlinear structures in data; spirals, for example. Approximate success of linear methods on small parts of nonlinear structures. Manifolds: smoothly curved surfaces embedded in higher-dimensional Euclidean spaces. Every manifold looks like a linear subspace on a sufficiently small scale, so we should be able to patch together many small local linear approximations into a global manifold. Local linear embedding: approximate each vector in the data as a weighted linear combination of its k nearest neighbors, then find the low-dimensional vectors best reconstructed by these weights. Solving the optimization problems by linear algebra. Coding up LLE. A spiral rainbow. R.
  16. Nonlinear Dimensionality Reduction II: Diffusion Maps (9 October). Making a graph from the data; random walks on this graph. The diffusion operator, a.k.a. Laplacian. How the Laplacian encodes the shape of the data. Eigenvectors of the Laplacian as coordinates. Connection to page-rank. Advantages when data are not actually on a manifold. Example.

    Pre-midterm review (12 October): highlights of the course to date; no handout.
    MIDTERM (14 October): exam, solutions

    Homework 5, due 23 October: assignment; solutions

  17. Regression I: Basics. Guessing a real-valued random variable; why expectation values are mean-square optimal point forecasts. The regression function; why its estimation must involve assumptions beyond the data. The bias-variance decomposition and the bias-variance trade-off. First example of improving prediction by introducing variance. Ordinary least squares linear regression as smoothing. Other linear smoothers: k-nearest-neighbors and kernel regression. How much should we smooth? R, data for running example
  18. Regression II: The Truth About Linear Regression (21 October). Linear regression is optimal linear (mean-square) prediction; we do this because we hope a linear approximation will work well enough over a small range. What linear regression does: decorrelate the input features, then correlate them separately with the response and add up. The extreme weakness of the probabilistic assumptions needed for this to make sense. Difficulties of linear regression; collinearity, errors in variables, shifting distributions of inputs, omitted variables. The usual extra probabilistic assumptions and their implications. Why you should always looking at residuals. Why you generally shouldn't use regression for causal inference. How to torment angels. Likelihood-ratio tests for restrictions of nice models.
  19. Regression III: Extending Linear Regression (23 October). Weighted least squares. Heteroskedasticity: variance is not the same everywhere. Going to consult the oracle. Weighted least squares as a solution to heteroskedasticity. Nonparametric estimation of the variance function. Local polynomial regression: local constants (= kernel regression), local linear regression, higher-order local polynomials. Lowess = locally-linear smoothing for scatter plots. The oracles fall silent.

    Homework 6, due Friday, 30 October: assignment, data set; solutions

  20. Evaluating Predictive Models (26 and 28 October). In-sample, out-of-sample and generalization loss or error; risk as expected loss on new data. Under-fitting, over-fitting, and examples with polynomials. Methods of model selection and controlling over-fitting: empirical risk minimization, penalization, constraints/sieves, formal learning theory, cross-validation. Limits of generalization. R for creating figures.
  21. Smoothing Methods in Regression (30 October). How much smoothing should we do? Approximation by local averaging. How much smoothing we should do to find the unknown curve depends on how smooth the curve really is, which is unknown. Adaptation as a partial substitute for actual knowledge. Cross-validation for adapting to unknown smoothness. Application: testing parametric regression models by comparing them to nonparametric fits. The bootstrap principle. Why ever bother with parametric regressions? R code for some of the examples.

    Homework 7, due Friday, 6 November: assignment; solutions: text and code

  22. Additive Models (2 November). A nice feature of linear models: partial responses, partial residuals, and backfitting estimations. Additive models: regression curve is a sum of partial response functions; partial residuals and the backfitting trick generalize. Parametric and non-parametric rates of convergence. The curse of dimensionality for unstructured nonparametric models. Additive models as a compromise, introducing bias to reduce variance. Example with the data from homework 6.
  23. Classification and Regression Trees (4 and 6 November). Prediction trees. A classification tree we can believe in. Prediction trees combine simple local models with recursive partitioning; adaptive nearest neighbors. Regression trees: example; a little math; pruning by cross-validation; more R mechanics. Classification trees: basics; measuring error by mis-classification; weighted errors; likelihood; Neyman-Pearson classifiers. Uncertainty for trees.

    Homework 8, due 5 pm on Monday, 16 November: assignment; solutions; R for solutions

  24. Combining Models 1: Bagging and Model Averaging (9 November)
  25. Combining Models 2: Diversity and Boosting (11 November)
  26. Linear Classifiers (16 November). Geometry of linear classifiers. The perceptron algorithm for learning linear classifiers. The idea of "margin".
  27. Logistic Regression (18 November). Attaching probabilities to linear classifiers: why would we want to? why would we use the logistic transform to do so? More-than-binary logistic regression. Maximizing the likelihood; Newton's method for optimization. Generalized linear models and generalized additive models; testing GLM specifications with GAMs.
  28. Support Vector Machines (20 November). Turning nonlinear problems into linear ones by expanding into high-dimensional feature spaces. The dual representation of linear classifiers: weight training points, not features. Observation: in the dual representation, only inner products of vectors matter. The kernel trick: kernel functions let us compute inner products in feature spaces without computing the features. Some bounds on the generalization error of linear classifiers based on "margin" and the number of training points with non-zero weight ("support vectors"). Learning support vector machines by trading in-sample performance against bounds on over-fitting.

    Homework 9, due at 5 pm on Monday, 30 November: assignment

  29. Density Estimation (23 November). Histograms as distribution estimates. Glivenko-Cantelli, "the fundamental theorem of statistics". Histograms as density estimates; selecting density estimates by cross-validation. Kernel density estimates. Why kernels are better than histograms. Curse of dimensionality again. Hint at alternatives to kernel density estimates.
  30. Mixture Models, Latent Variables and the EM Algorithm (30 November). Compressing and restricting density estimates. Mixtures of limited numbers of distributions. Mixture models as probabilistic clustering; finally an answer to "how many clusters?" The EM algorithm as an iterative way of maximizing likelihood with latent variables. Analogy to k-means. More theory of the EM algorithm. Applications: density mixtures, signal processing/state estimation, mixtures of regressions, mixtures of experts; topic models and probabilistic latent semantic analysis. A glance at non-parametric mixture models.
  31. Graphical Causal Models (2 December). Distinction between causation and association, and between causal and probabilistic prediction. Some examples. Directed acyclic graphs and causal models. The Markov property. Conditional independence via separation. Faithfulness.
  32. Causal Inference (4 December). Estimating causal effects; control for confounding. Discovering causal structure: the SGS algorithm and its variants. Limitations.

    Take-home final exam, due 15 December: assignment; data sets: expressdb_cleaned (20 Mb), HuIyer_TFKO_expression (20 Mb). With great thanks to Dr. Timothy Danford.

Corrupting the Young; Enigmas of Chance

Posted by crshalizi at December 05, 2009 14:39 | permanent link

November 30, 2009

Books to Read While the Algae Grow in Your Fur, November 2009

Jen Van Meter, Christine Norrie and Chynna Clugston-Major, Hopeless Savages
Incredibly sweet and charming; whether it's really punk rock I couldn't say. (I completely forget where I saw this recommended, but thanks to whoever it was.)
Mike Mignola and Christopher Golden, Baltimore, or, the Steadfast Tin Soldier and the Vampire
Stories within stories, framed by the Great War unleashing not the influenza pandemic of 1918, but a vampire-zombie apocalypse. Many, many nods to prior horror fiction (most obviously Dracula, but also "The Masque of the Red Death", etc.), and a lot of folkloric elements used to nicely creepy effect. (But isn't "Mircea" a masculine name?) Mignola's drawings are decorative and atmospheric, but not integral.
Cat Rambo and Jeff VanderMeer, The Surgeon's Tale, and Other Stories
The highlight is the title story, which occupies about half this little book, and breathes new life — you should forgive the expression — into the ancient trope of the Resurrection Gone Awry. Of the rest, Rambo's "The Dead Girl's Wedding March" and "A Key Decides Its Destiny" are the best, followed by VanderMeer's "The Farmer's Cat". About the last story, an extended joke about a Lovecraftian menu, the kindest thing I can say is that the authors must've had fun writing it.
F. T. Marinetti, The Untameables
Not actually recommended, unless you want a violent Futurist words-in-liberty fantasy full of orientalism, racism, and (most poisonously) formulaic decadence. On this evidence, Marinetti was much better at writing manifestoes (and cookbooks) than fiction. — I have had this on my shelf since, so help me, 1994, when I first started reading about Futurism; I should've gotten rid of it long ago.
Jason Aaron, R. M. Guéra, Davide Furnò and Francesco Francavilla, Scalped, vol. 5: High Lonesome
Noir blacker than coal-dust. Earlier installments: 1, 2--4.
Phil and Kaja Foglio, Agatha Heterodyne and the: Golden Trilobite; Voice of the Castle; Chapel of Bones
Vols. 6--8 of Girl Genius; in which the lost heir reclaims the ancestral castle, through the power of Science! (As well as perfecting the coffee-maker.)
Lev Vygotsky, Mind in Society: Development of Higher Psychological Processes
A fairly clear and cohesive statement of Vygotsky's key ideas, which were a species of pre-cognitive Marxist psychology. Here is his concerned with looking at what happens to children's cognitive development when they bring together their practical abilities to manipulate their bodies and tools, with their communicative abilities to use words (and other signs) — specifically their learning to use speech to guide behavior, especially their own behavior. (This is very much about the unity of theory and praxis; but Dewey said similar things, from a background of American pragmatism. [Then again, Marx himself was pretty pragmatist already in 1845.] Vygotsky mentions Dewey once here, without much understanding.) Specifically, Vygotsky claims that speech and discursive thought come to guide behavior through children learning to talk to themselves about what they need to do to solve practical tasks, which they come to from previously learning to talk to others about what to do, or trying to get others to do it for them.
This sets up three big themes of Vygotsky's. First, he thinks that all of the characteristically human ("higher") mental processes originate as social interactions, which we then learn to internalize and carry out independently. The Marxist themes (especially out of Engels) here are obvious; he does not, needless to say, demonstrate his contention, and in any case seems to overlook the point that an organism needs a lot of specialized structure and capacity to engage in those social interactions in the first place, let alone internalize them! But he deserves, I think, considerable credit for raising the problem, and the related one of how we use tools and signs in our environment to extend our own cognition. (Here some of the experiments he reports on what's needed for children to make effective use of memory aids are fascinating; but this approaches stigmergy rather than social labor.) Secondly, he emphasizes, when assessing children's development, that looking at what they can do on their own is just picking out (at best) what they have finished learning. If instead one assesses what they can do "with assistance, collaboratively or in groups", what he calls the "zone of proximal development", one gets a sense of what they are learning and could learn. One might argue, though he doesn't pursue this, that even in adults, activities like scientific investigation never really leave the zone of proximal development, especially not at the highest levels of accomplishment. (Skilled scientists can do their students' homework problems on their own, but not their research.) Thirdly, he is emphatic that if you want to investigate cognitive development, you need to probe the developmental process, and not just its end-products in over-learned habits or polished skills. He suggests that the ideal experiments would actually evoke cognitive development under laboratory conditions, and that his students carried out such experiments; he does not provide enough details to assess such a claim.
The book concludes with some chapters on play, imagination and make-believe, and on the "pre-history" of writing (i.e., things which come before writing in individual development but have some kinship to it).
Despite the way I've written, this isn't actually a book Vygotsky planned and wrote out. It was assembled by its translators out of distinct Russian manuscripts and preliminary translations provided by A. R. Luria, and then considerably revised by the translators. (I presume this is where anarchonisms like "World War I" came from.) How much of the result is really due Vygotsky, or to Vygotsky-and-Luria, and how much to the Americans, I can't say. The latter do however provide an introduction, a brief biographical note and an extensive afterword; all of these are probably most useful to readers previously unacquainted with Vygotsky or his school.
Michael Bérubé, The Left at War
The first half of this is Bérubé arguing with (as he says) the "Manichean Left" over Afghanistan, Iraq, the Balkan wars of the 1990s, and their general orientation and understanding of how the capitalist democracies work, or don't. I find myself in complete agreement with this, including Bérubé's positive vision, and unable to add anything of value. (Read it.)
The second half is an argument about the theory of ideology, the notion of hegemony, and the not-exactly-a-discipline of cultural studies. More specifically, it's a plea to get beyond the dualism of "everything with any shred of mass appeal is a tool of the System" on the one hand, and pretending that fans reading their own meaning into music videos (or whatever) has anything to do with smashing said system, on the other. His plea is for a more nuanced approach to ideology, which recognizes that the leading ideas of any society are never all of a block, that political power always comes from coalitions bringing together many divergent interests and ideas, etc., etc. He is particularly fond of a version of these ideas articulated by Stuart Hall, which seem, to judge by his quotations, quite reasonable but not at all special, unless one is starting out from the most benighted precincts of Marxism (e.g., that old fraud Althusser). Linked to this, Bérubé is quite strenuous about the importance of issues other than economic justice for any left that wants to be serious about making sure that everyone isn't just formally free, but can actually use and enjoy their freedom.
Again, I find it hard to disagree with most of this; I just fail to see what the two halves of the book have to do with each other. The best argument I can reconstruct on his behalf would be something like this: lots of people on the left have drifted, or been pushed, into Manichean positions because it seems to follow from the way they understand ideology. If a better account had been widely disseminated, fewer of them would have pursued that dead end. Some parts of cultural studies had articulated that better account, but they failed to make themselves heard; thus, if only more attention had been paid to debates about how best to make use of Gramsci to understand British politics in the early 1980s, the left would not have backed itself into such a corner in 2001--2003.
I find myself drifting into snark in that last sentence, which is unfair. I agree that the very crude counter-cultural thinking and functionalism which seem very common on the left are Not Helpful. I'm just skeptical that (1) giving more weight to cultural studies would have made this better, and that (2) the particular cultural studies sub-tradition Bérubé points to is really the best available way of thinking about these matters. I have my own favorite candidate, but mostly it's that what he takes from this sub-tradition just don't seem that distinctive, except for its Marxist background. (For instance, compare what Bérubé, following Hall, says about first asking what's right or true about an ideology with what Boudon said in his excellent book on The Analysis of Ideology [= L'origine des idées reçues]. And no, I'm not trying to play "My esoteric French theorist trumps your merely-obscure British theorist"; Boudon is a distinguished sociologist who just happens to have made this the center of his theory of ideology.) Which is not to say that we shouldn't work to develop and disseminate better ideas about these matters!
Bérubé tends to avoid advancing his case directly, but rather to get his point across by discussing some other writer's work, or in some cases some other writer's discussion of a third author, etc. (I suspect this is a professional deformation of literary critics.) This is a manner of writing which drives some people crazy, and one which I find very tiresome in other hands, but he pulls it off. Assuming this won't put you off, I recommend this very strongly if you have any interest at all in progressive politics.
Disclaimer: Bérubé blogs at Crooked Timber, where I've guest-blogged, etc., but we've never met or corresponded, and I have no stake in the success of his book.
Manual trackback: Michael Bérubé.
Rick Geary, Trotsky: A Graphic Biography
Well-told and well-drawn — though nothing in the rest of the art matches to the level of the hero/monster panels of the opening pages (and cover!).
Sarah Graves, Unhinged; Mallets Aforethought; Tool and Die; Nail Biter; Trap Door
Honestly, I'm a bit surprised series fatigue hasn't set in yet; but I continue to enjoy these. (And Eastport continues to suffer levels of violent death comparable to post-invasion Iraq. The fact that this homicide rate has not yet attracted official attention suggests that the serial killer is Bob Arnold, the police chief, rather than Jake Tiptree or Ellie White.) Previous installments: 1--4, 5; sequel: 11.
Jonathan Israel, A Revolution of the Mind: Radical Enlightenment and the Intellectual Origins of Modern Democracy
The story, or part of the story, of how the outlandish and unprecedented ideology of a network of radical, subversive scribblers became what we all at least pay lip-service to. Really deserves a detailed discussion; I'll just say that there's a lot of fascinating material in here, but also many places where I felt he didn't really prove his point, even or especially when I was very sympathetic to what he was saying.
Stephen Budiansky, The Bloody Shirt: Terror After the Civil War
Once upon a time, the US Army attempted to bring democracy to a backward part of the world which had long been wracked by ethnic conflict. There were some promising beginnings, but the defeated, formerly dominant faction refused to accept that their relative demotion, and engaged in a vicious, well-organized campaign of terrorism, which ultimately proved to be entirely successful. Those who had trusted enough in the power and benevolence of the United States enough to participate in the governments ultimately overthrown by "violence and fraud" (in the words of one of the over-throwers) were lucky to escape with their lives (as many did not). Minimal democratic norms were not re-established for ninety years or more.
This is of course the story of the failed Reconstruction of the South after the civil war, which Budiansky tells by recounting the inter-cut, and occasionally overlapping, lives of a number of individuals on the Reconstruction side of the conflict. One of his more effective tactics is to quote extensively from their letters and journals, as well as from contemporary books and newspapers. Caveat lector: many of these — especially the newspapers — are full of vicious racist bile, as well as the astonishing lies elite white Southerners told to portray themselves as oppressed victims. (This begins with the story of the "bloody shirt" that opens the book.) This stuff was hard for me to stomach, and might be too much for some.
My biggest complaint with the book is that I wish Budiansky had done more to tell the stories of black Americans, the way he did with his white subjects — not that there are none, I hasten to add. I can guess at reasons why it would be harder to find materials (all of them ultimately having to do with the fact that Southern blacks were an oppressed people who emerged from slavery for a few years before being crushed back down to serfdom), but still... That said, Budiansky's story of crushed hopes, futile bravery and murderous hatred is wonderfully written and incredibly depressing. I hope that it fills many American with the sort of patriotic shame which helps us be better.
Luc Devroye and Gabor Lugosi, Combinatorial Methods in Density Estimation
The fundamental theorem of statistics, says Pitman, is the Glivenko-Cantelli theorem: the empirical distribution function Fn of a large sample of independent, identically-distributed random variables comes arbitrarily close to their true distribution function F: as n goes to infinity, maxx |Fn(x) - F(x)| goes to 0 almost surely. This means that we can learn any underlying probability distribution to arbitrary accuracy just by collecting enough data.* Unfortunately the empirical distribution function is always discrete, so it doesn't have a density, even if the underlying distribution does. Or, if you like, it has a density, but it's a mixture of Dirac delta functions. (The convergence is in the sense of "weak convergence" or "convergence in distribution".) Density estimation is basically about taking the empirical distribution function and smoothing it so that it has a well-behaved density. The oldest way of doing this is to build a histogram, which gives constant densities to intervals; other methods include fitting function series (Fourier or wavelet expansions) to the data, or using kernels (replacing each of the delta function spikes with a smooth density, say a Gaussian bell-shaped curve). The art here is to pick the manner of smoothing, and the amount of smoothing, so that (1) the convergence promised by Glivenko-Cantelli for the unsmoothed distribution is not just maintained but is (2) strengthened to convergence of the estimated density on the true density, and ideally (3) the latter convergence happens rapidly.
Devroye and Lugosi's book is devoted to establishing conditions under which common density estimators have these three desirable properties (or, more rarely, when they do not). Throughout, they focus on the "total variation" or L1 distance between densities: dTV(f,g) is the integral of |f(x) - g(x)| over all x. They mention, but generally avoid, other common distances or pseudo-distances such as L2 (integral of |f(x) - g(x)|2), Hellinger distance (too ugly to write in HTML), or relative entropy (Kullback-Leibler divergence, expected log-likelihood ratio). The total variation distance has a very natural probabilistic interpretation (the maximum amount by which the estimated probability of any event differs from its true probability), and they can get very nice finite-sample bounds by minimizing it over various classes of possible estimates, so this choice is eminently defensible; it does however cut them off from using a lot of existing theory. (For instance, the optimal coefficients in a Fourier series, from an L1 point of view, are not just the empirical Fourier coefficients, since the latter are L2 optimal.)
Their general goal is to prove finite-sample upper bounds on the L1 error of their density estimates; if these go to zero as n grows, we get (1) and (2) above, and the rate of convergence tells us how close we are to obtaining (3). Their route to this goal is almost always through VC theory, and empirical process theory more generally. As always, this has two parts: one is deviation inequalities (e.g., Hoeffding's) which bound the probability that any one candidate density will look much better in sample than it will look out of sample. The other part is combinatorial arguments that the behavior of an entire space of functions can be approximated by that of a finite number of key functions. Meshed together by a union bound, these give uniform concentration bounds, with rates of convergence depending on the complexity of the combinatorial construction needed to achieve a given degree of approximation (i.e., the VC dimension). Devroye and Lugosi's key theorems bound the error of their density estimates in terms of the VC dimensions of the sets formed by comparing two densities in the class. (Specifically, they are interested in the sets where one estimate is higher than another by a given amount; this is, as they note, extremely similar to the threshold procedure used to apply VC theory to regression problems.) Finite VC dimension for such sets implies convergence to within a constant factor of the best available approximation to the true density. They extend such results to ones where the amount of smoothing is determined by data-set splitting, i.e., dividing the data into a training and a testing set, and picking the degree of smoothing which best generalizes from the training set to the testing set. (They do not consider any other form of cross-validation, which is a shame because they're a lot more common than simple data-splitting, but understandable because they're very ugly to analyze.) They give a lot of attention to kernel density estimates, including bounds for continuous kernels in terms of how hard it is to approximate them by simple step-functions, for which the combinatorics are easy.
Strictly speaking, the book presupposes measure-theoretic probability, but readers uncomfortable with sigma-fields and Radon-Nikodym derivatives could mostly get away with ignoring the former and reading "probability density functions" for the latter. Similarly, the actual combinatorics are either elementary, or can be taken on trust. This book is probably not the best way to first encounter density estimation — I suspect a less theoretical introduction would not only make the ideas clearer, but also make readers want theoretical guidance — but no experience on that score is, strictly, necessary. Neither, really, is prior knowledge of learning theory or VC theory, though again it would probably help. The ideal situation for the book is, I'd guess, a second-year graduate-level course on density estimation (there are many excellent problems), or self-study.
*: Well, we have to pretend the data are IID, but let that slide. Or: assume sufficiently rapid strong mixing and argue, as in Vidyasagar, that VC results then hold with tolerable corrections. Kernel density estimates for stochastic processes are treated at length in Bosq's Nonparametric Statistics for Stochastic Processes: Estimation and Prediction, but the starting point there is ergodic theory, not learning theory.
George Clark, Science and Social Welfare in the Age of Newton
Connections between the scientific revolution, economic development and economic policy (such as it was) in late-17th and early-18th century England, and to a lesser extent France and the Netherlands. Interesting stuff on the connections between the activities of scientists and technological development, including the shrewd observation, contra Marxists claiming that scientific progress was basically directed to solving the capitalists' problems, that there were plenty of lucrative problems where scientists got nowhere, or didn't even try to get anywhere, because it was just not scientifically feasible. Also some interesting material on the early history of statistics. The first edition was published in 1937, and shows both that it was written during the Depression, and that respectable economists had no idea what was going on. (This does not much harm the book.)
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; The Progressive Forces; The Great Transformation; The Beloved Republic; Minds, Brains, and Neurons; Enigmas of Chance; The Continuing Crises; Cthulhiana

Posted by crshalizi at November 30, 2009 23:59 | permanent link

November 24, 2009

(I Don't Give a Damn About My) Bad Reputation

Attention conservation notice: Unskillful nattering about pop-culture ephemera.
For the sake of my own sanity, I prefer to remain ignorant of the occult processes by which the direct mail gods decide to which catalogues to send to which people. (There's too much dynamic programming involved.) Today, for instance, they decided to inflict upon me the official Barbie doll spring 2010 collection preview, and like a fool I couldn't resist looking through it. Thus my life is made that much worse by learning that there is a Joan Jett Barbie doll. (I thought about embedding an image, but in this case pain shared is not pain eased.) I think I finally grasp what people mean when they talk about later cultural products assaulting parts of their childhood, in this case one I didn't even realize I valued.

Manual trackback: Mostly Harmless

Linkage

Posted by crshalizi at November 24, 2009 19:27 | permanent link

November 21, 2009

"Homophily, Contagion, Confounding: Pick Any Three"

A number of people have asked for my slides from the MERSIH conference the other week. So, here they are. (Anyone who was at my talk at SFI about a year ago will recognize the title, and much of the content.) I'm presently turning this into a proper manuscript, so comments are welcome. Please don't rip it off; I'll become very cross and may even hold my breath until I turn blue and pass out, and won't you be sorry then?

Manual trackback: Cognition and Culture

Networks; Enigmas of Chance; Complexity; Self-Centered

Posted by crshalizi at November 21, 2009 18:32 | permanent link

November 19, 2009

"Statistical Analysis of Stellar Evolution" (Next Week at the Statistics Seminar)

In which the starry heavens above submit to statistical analysis:

David van Dyk, "Statistical Analysis of Stellar Evolution"
Abstract: Color-Magnitude Diagrams (CMDs) are plots that compare the magnitudes (luminosities) of stars in different wavelengths of light (colors). High non-linear correlations among the mass, color and surface temperature of newly formed stars induce a long narrow curved point cloud in a CMD known as the main sequence. Aging stars form new CMD groups of red giants and white dwarfs. The physical processes that govern this evolution can be described with mathematical models and explored using complex computer models. These calculations are designed to predict the plotted magnitudes as a function of parameters of scientific interest such as stellar age, mass, and metallicity. Here, we describe how we use the computer models as a component of a complex likelihood function in a Bayesian analysis that requires sophisticated computing, corrects for contamination of the data by field stars, accounts for complications caused by unresolved binary-star systems, and aims to compare competing physics-based computer models of stellar evolution.
This is joint work with Steven DeGennaro, Nathan Stein, William H. Jefferys, Ted von Hippel, and Elizabeth Jeffery.
Place and time: Doherty Hall A310, Monday, 23 November, 4--5 pm.

Enigmas of Chance; The Eternal Silence of These Infinite Spaces; Physics

Posted by crshalizi at November 19, 2009 12:02 | permanent link

November 13, 2009

"Some Things Statisticians Do at Google" (Next Week at the Statistics Seminar)

Attention conservation notice: Of no use to you unless (1) you want to know what statisticians do at search-engine companies and (2) you are in Pittsburgh.
Mike Meyer, "Some Things Statisticians Do at Google"
Abstract: I'll talk about a number of projects at Google where statisticians have made a large contribution. There will not be a lot of technical details. In some cases I will just describe the problem.
The major example will be a description of the statistical and engineering infrastructure to support live traffic experiments at Google.
A common theme of the problems is the importance of understanding basic statistical principles that can be applied and modified to handle new data and new circumstances.
Place and time: Monday, 16 November at 4 pm, in Doherty Hall A310

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at November 13, 2009 15:09 | permanent link

November 08, 2009

The Shadow Price of Power

Attention conservation notice: Quasi-teaching note giving an economic interpretation of the Neyman-Pearson lemma on statistical hypothesis testing.

Suppose we want to pick out some sort of signal from a background of noise. As every schoolchild knows, any procedure for doing this, or test, divides the data space into two parts, the one where it says "noise" and the one where it says "signal".* Tests will make two kinds of mistakes: they can can take noise to be signal, a false alarm, or can ignore a genuine signal as noise, a miss. Both the signal and the noise are stochastic, or we can treat them as such anyway. (Any determinism distinguishable from chance is just insufficiently complicated.) We want tests where the probabilities of both types of errors are small. The probability of a false alarm is called the size of the test; it is the measure of the "say 'signal'" region under the noise distribution. The probability of a miss, as opposed to a false alarm, has no short name in the jargon, but one minus the probability of a miss — the probability of detecting a signal when it's present — is called power.

Suppose we know the probability density of the noise p and that of the signal is q. The Neyman-Pearson lemma, as many though not all schoolchildren know, says that then, among all tests off a given size s, the one with the smallest miss probability, or highest power, has the form "say 'signal' if q(x)/p(x) > t(s), otherwise say 'noise'," and that the threshold t varies inversely with s. The quantity q(x)/p(x) is the likelihood ratio; the Neyman-Pearson lemma says that to maximize power, we should say "signal" if its sufficiently more likely than noise.

The likelihood ratio indicates how different the two distributions — the two hypotheses — are at x, the data-point we observed. It makes sense that the outcome of the hypothesis test should depend on this sort of discrepancy between the hypotheses. But why the ratio, rather than, say, the difference q(x) - p(x), or a signed squared difference, etc.? Can we make this intuitive?

Start with the fact that we have an optimization problem under a constraint. Call the region where we proclaim "signal" R. We want to maximize its probability when we are seeing a signal, Q(R), while constraining the false-alarm probability, P(R) = s. Lagrange tells us that the way to do this is to minimize Q(R) - t[P(R) - s] over R and t jointly. So far the usual story; the next turn is usually "as you remember from the calculus of variations..."

Rather than actually doing math, let's think like economists. Picking the set R gives us a certain benefit, in the form of the power Q(R), and a cost, tP(R). (The ts term is the same for all R.) Economists, of course, tell us to equate marginal costs and benefits. What is the marginal benefit of expanding R to include a small neighborhood around the point x? Just, by the definition of "probability density", q(x). The marginal cost is likewise tp(x). We should include x in R if q(x) > tp(x), or q(x)/p(x) > t. The boundary of R is where marginal benefit equals marginal cost, and that is why we need the likelihood ratio and not the likelihood difference, or anything else. (Except for a monotone transformation of the ratio, e.g. the log ratio.) The likelihood ratio threshold t is, in fact, the shadow price of statistical power.

I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained marginally before, but in retrospect it seems too simple to be new, so pointers would be appreciated.

Manual trackback: John Barrdear

Updates: Thanks to David Kane for spotting a typo.

*: Yes, you could have a randomized test procedure, but the situations where those actually help pretty much define "boring, merely-technical complications."

Enigmas of Chance

Posted by crshalizi at November 08, 2009 03:06 | permanent link

November 04, 2009

Blosxom Fading in November

My old Blosxom installation (v. 2.0.2), after several years of working nicely, is growing increasingly cranky, and mulishly refusing to generate or update posts as the whim takes it. (I am not sure how much kicking and shoving it will need to produce this.) I'd appreciate a pointer to something which works similarly, but does work: I write posts in plain HTML in Emacs and drop them in a directory; it makes them look nice. If it handles tags and/or LaTeX nicely, so much the better.

Self-Centered

Posted by crshalizi at November 04, 2009 19:34 | permanent link

October 31, 2009

Books to Read While the Algae Grow in Your Fur, October 2009

Rosemary Kirstein, The Lost Steersman
Sequel to Steerswoman's Road (below); excellent and perfectly continuous, despite a long gap in the writing. The trick of celebrating intelligence while maintaining the tone and color of a good fantasy novel is not something I have encountered elsewhere, and find deeply addictive. — Sequel
Everything else I have to say is a spoiler: This owes a massive debt to Lovecraft's At the Mountains of Madness. The plot-hinge mystery here has to do with "demons", amphibious barrel-shaped creatures with quadrilateral symmetry, very like (though not exactly the same as) Lovecraft's Antarctic Old Ones. There are scenes of dissecting demons under the impression that they are just animals, and realizing they belong to some radically different division of life than familiar terrestrial organisms; an exploring expedition to an unknown part of the world where the demons are found; explorations of demons' cities and observations of their customs, including subterranean chambers used for their rituals, etc.; and the dawning realization that the creatures are in fact sapient. (HPL: "Radiates, vegetables, monstrosities, star spawn — whatever they had been, they were men!" RK does not put such florid outbursts in her characters' mouths; she just has Rowan come to see that the demons are "people".) Kirstein does a better job, in my view, of making the creatures actually alien, in particular starting from giving them a very inhuman sensorium (continual sonar, without any vision) and means of communication (excreting specially shaped lumps of organic material, reminiscent of the pieces of carved soapstone Lovecraft associated with his Old Ones), and building out logically from there. Needless to say, this complicates the ethics of terraforming the steerswomen's planet considerably. — Janus's speeches (pp. 43--44 and p. 356) about the dangers of learning too much about the world also seems drawn from Lovecraft, though they bring to mind the opening of "The Call of Cthulhu" more than Mountains.
Jeffrey D. Hart, Nonparametric Smoothing and Lack-of-Fit Tests
A sound, friendly but reasonably theoretical introduction to nonparametric regression, giving about equal attention to kernel-based methods and to series expansions (Fourier series, orthogonal polynomials, etc.). The first half of the book, through ch. 4, introduces these methods, considers their ability to predict new data (emphasizing, naturally, the bias-variance trade-off), and looks at methods for selecting how much smoothing to do based on the data being smoothed, with a fondness for leave-one-out cross-validation and its variants. (I can't recall if k-fold CV is even mentioned.) The second half is about testing parametric regression specifications. Chapter 5 reviews some classical tests for fully-specified and especially for linear-in-the-parameters parametric models, including Neyman's smooth tests: the latter involve, roughly speaking, fitting an orthogonal series to the deviations from the null model, and checking that all the coefficients are small, and so form a bridge to the smoothing-based tests used in the rest of the book. Basically, one can either smooth the parametric residuals, which should have mean zero and constant variance under the null hypothesis, or compare the parametric estimate to the nonparametric smooth. Hart prefers the former approach, and develops tests for regression functions being constants in chapter 7, which in chapter 8 are turned into tests for departures from arbitrary parametric regression models. The distribution of these test statistics is too complicated for anything except bootstrapping, which needs to be done carefully to preserve power. To simplify the math, up to this point Hart assumes that the input variable takes values at a deterministic set of points on the unit interval ("fixed-design univariate regression"); chapter 9 generalizes to random-design and multivariate regressions, as well as lifting some other restrictions. Chapter 10 contains some case studies of real data.
This book should be accessible to anyone who understands parametric inference at the level of, say, All of Statistics; no prior exposure to smoothing methods is really needed. The series-expansion methods will probably go down more easily with some priori exposure to Fourier analysis. People who are serious about using parametric regression models in the real world (cough econometricians cough) owe it to themselves to test them with these methods.
Rosemary Kirstein, Steerswoman's Road
First two books in an epic fantasy series about the scientific method, reprinted in one volume. There are more books, which I now covet powerfully, but the series is not finished.
Spoilers: "Epic fantasy" here is, I am pretty sure, totally misleading. Initially, and from most of the characters' perspectives, the world looks like a bog-standard medieval fantasyland, only with the addition of an itinerant semi-monastic order of geographers and natural philosophers, the eponymous steerswomen. By the end of this volume, however, I am pretty sure that the setting is actually another planet in this universe, with no magic at all. The steerswomens' world is being terraformed; the Guidestars are satellites in geosynchronous orbit. The native ecology (based on "blackgrass" and "redgrass") is being systematically destroyed (by microwave heating from the orbiting Guidestars ("the spell Routine Bioform Clearance"), and by the Outskirters' goats [which may be genetically modified?]) and replaced by terrestrial flora ("greengrass"), microbes and fauna. Wizards are simply the inhabitants of the planet who retain the old technology, such as electricity and explosives. Why most of the colonists have regressed to medieval technology, and why, having done so, they have an institution like the steerswomen, I couldn't tell you. (I can tell you that sailors and steerswomen are immune to some "curses" because they wear rubber-soled, i.e. electrically insulating, boots.) But I am dying to find out.
J. K. Ghosh and R. V. Ramamoorthi, Bayesian Nonparametrics
I have written extensively about the general subject of Bayesian nonparametrics and especially of its consistency elsewhere (here, here, or, indeed, here), so I'll just plunge in. This 2003 monograph is the best overview of Bayesian nonparametrics from the viewpoint of theoretical statistics which I've found, though there has been a great deal of work since it was written, and I know that a number of new books are coming out soon.
The author begin (ch. 1) by reviewing* results on the consistency of Bayesian learning on finite sample spaces and Dirichlet prior distributions. They then carefully (ch. 2) consider the measure-theoretic issues involved in constructing prior probability distributions over infinite-dimensional spaces, especially priors over all probability measures (or all probability densities) on the real line. Chapter 3 describes in detail the properties of Dirichlet-process priors and of Polya tree priors. Chapter 4 is concerned with consistency for Bayesian updating with IID data, emphasizing the "Kullback-Leibler property" (the prior must put sufficient weight on distributions with small relative entropy from the truth) and the exponentially-consistent-testing conditions which go back to Schwartz. Chapter 5 specializes to inferring probability densities; this is the only place they use Gausian process priors. Chapter 6 considers inferring the location parameter of distributions of unknown shape, and outlines (without full detail) the notorious examples, due to Freedman and Diaconis, of how Bayesian learning can fail to be consistent. Chapter 7 considers linear regression with an unknown noise distribution; this is the only departure from assuming IID data made here. The remaining chapters try to construct uniform distributions on infinite-dimensional spaces, look at some issues in survival analysis, and technical aspects of "neutral to the right" priors, ones whose cumulative hazard functions have independent increments. It is assumed throughout that the true, data-generating distribution lies within the support of the prior.
Ghosh and Ramamoorthi focus on mathematical issues, to the exclusion of computational and statistical considerations. (There are no applications to data, or even to elaborate simulations.) The writing is adequate for a work of "theorem-proof, theorem-proof" math, but no more. Those proofs, however, are really clear and clean, without tricks or complications. I recommend the book for those who want to understand, in depth, the technicalities of constructing priors on infinite-dimensional spaces, and of establishing their consistency when updated with IID data. There are a handful of exercises at the end of the book, but I do not think it would be suitable as a classroom textbook. It could work as the first part of an advanced graduate seminar, or for self-study for motivated and mathematically mature readers.
*: Actually, it's my impression that lots of introductions to Bayesian statistics, even at the graduate level, do not cover these results. This is, I think, something of a scandal for the profession. That goes double if it's due to the attitude which Ghosh and Ramamoorthi (p. 122) paraphrase as "the prior and the posterior given by Bayes theorem [sic] are imperatives arising out of axioms of rational behavior --- and since we are already rational why worry about one more" criterion, namely convergence to the truth. One does indeed find pernicious relativism and epistemic nihilism everywhere these days!
Nadia Gordon, Lethal Vintage
Continuing amateur-sleuthing adventures of a Napa Valley restaurant-owner and her foodie (and wine-y) friends. No prior acquaintance with the series [1, 2, 3] needed.
Larry Gonick, The Cartoon History of the Modern World, Part II: From the Bastille to Baghdad
My parents got the first part of the Cartoon History of the Universe (in its original, shorter edition) for my brother and I in 1981, when I was seven. We loved it so much they ended up having to get us two copies. I have thus been reading the History, as it came out, all my conscious life. (And re-reading it, without any visitations from the Suck Fairy so far.) This latest volume is, as always a delight, but not a pure one, because it's also the last. I can understand wanting to be finished with the work of a lifetime, especially one which in the nature of things could be spun out indefinitely; but I can't help wishing for more.
G. Willow Wilson and M. K. Perker, Air, vol. 2: Flying Machine
James Sethna, Statistical Mechanics: Entropy, Order Parameters, and Complexity
The best introductory statistical mechanics book I have ever seen. (Meaning: advanced undergraduates, not the graduate level of Landau and Lifshitz.) The reader is supposed to have some familiarity with classical and quantum mechanics, a little electromagnetism, and the very barest rudiments of thermodynamics, the latter not going beyond what's in a good first-year physics course. Beyond the basics of differential equations and linear algebra, the only real pieces of math used here are Fourier transforms and elementary probability (such as one sees in undergraduate quantum mechanics). On this basis Sethna erects classical (and, in one chapter, quantum) statistical mechanics, emphasizing the modern applications of the theory and physical intuition.
The exposition begins with random walks, including diffusion and the central limit theorem. The micro-canonical ensemble comes next, along with a very nice chapter on its ergodic basis and failures of ergodicity (such as KAM theory). The other ensembles are derived from imposing the micro-canonical ensemble on the whole system, and looking at marginal distribution of sub-systems. The elaborate axiomatic structure of pure thermodynamics is touched on only briefly; thermodynamic quantities are seen, quite properly, as derivative of statistical-mechanical ones. The question of what macroscopic variables need to be included in the free energy leads naturally to a superb chapter on the meaning and identification of order parameters. This in turn is followed by a really lucid treatment of the connections between spontaneous fluctuations, the decay correlations, response to external forces, and the dissipative approach to equilibrium. The whole is capped off by chapters on abrupt (e.g., ice-water, water-steam) and continuous (e.g., magnetic) phase transitions, including a nice hand-waving discussion of the renormalization group. In addition to the main thread of exposition, each chapter has a large collection of problems, ranging from mathematical proofs through calculations to simulation challenges, which contain a lot of neat applications and special topics, and should at least be read if not attempted.
There are a few places where I would quibble — per Lebowitz, surely the Boltzmann entropy is more useful out of equilibrium than the Gibbs?; couldn't he have been more explicit about the probabilistic foundations of renormalization? — but mostly I just wish this book had been written sixteen years ago when I was taking stat. mech.
Disclaimer: Friends of mine used to work for Sethna, and he's lectured at the SFI summer school (the chapter on order parameters began as a lecture there in 1991), but I've never met him, and have no stake in the success of the book.
Update: Thanks to T. A. Abinandanan for alerting me to the fact that there's a free PDF of the whole book!
Laura E. Reeve, Vigilante
Sequel to Peacekeeper, with an even more awful and totally-misleading cover. (The synposes at the link are accurate, however.) Tasty mind-candy.
C. L. Anderson, Bitter Angels
Space opera about active struggles to prevent war, and other morally-compromising endeavors; military science fiction that lets me respect myself in the morning. The climax, where it becomes clear what is going on, and why and how, and what the peace-keepers will do about it, with what consequences, was very fine indeed. Picked up after reading the author's self-advertisement on Scalzi's blog, which has more.
Madeleine E. Robins, Petty Treason
More alternate-history Regency England private-eye detection (romance-free this time). Very enjoyable; I wish there were more.

Books to Read While the Algae Grow in Your Fur; Enigmas of Chance; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Writing for Antiquity; The Great Transformation; Physics; Complexity; Cthulhiana

Posted by crshalizi at October 31, 2009 23:59 | permanent link

October 13, 2009

The Professions Considered as Pitchers of Icy Refreshing Lemonade

Attention conservation notice: Idle economic musings of a non-economist. Sparked by recent developments, but if you're interested in that you'd be better off elsewhere.

The usual libertarian story about professional licensing requirements — e.g., requiring someone who wants to practice medicine to go to medical school and pass exams, on pain of fines or jail — is that these are simply professionals conspiring in restraint of trade. Licensing simply erects a barrier to entry into the market for medical services, restricting supply and driving up price. Eliminate it, they say, and supply will expand and prices fall.

This presumes, however, that the demand for unlicensed professionals will be equal to the demand for licensed ones. It seems to me very easy to tell a "market for lemons" story here: someone in the market for professional services generally knows very little about how skilled various potential providers actually are. The sellers, however, generally know a lot about their own skill level, or at least more than the potential clients do. (There are no doubt exceptions, such as sincere quacks and the Dunning-Kreuger effect, but I don't think matters for the story.) This is the classic asymmetric information problem from Akerlof, with the usual result: the skilled providers demand more, but the clients have no way of telling them from the unskilled ones, so the only equilibrium is for only unskilled providers to be on the market and for trade to be depressed, or indeed absent. By putting a floor on the incompetence of professionals, licensing requirements stop the unraveling of the market and increase demand. They get us out of the market for lemons.

This occurred to me the other day, but it's obvious enough that I'm sure someone wrote it up long ago; where? (And did I read it and forget about it?)

(After-notes: 1. Of course, having told the story I have no idea if it's true of actual markets for professional services; learning that would require rather delicate empirical investigations. Checking the restraint-of-trade fable from Milton Friedman would, naturally, require those same investigations. 2. This doesn't rationalize why professions should be so largely self-governing, nor does it rule out the idea that some licensing requirements are counter-productive barriers to entry. 3. Replacing professional certification with some sort of market-based entity telling consumers about the quality of professional service-sellers won't work, for all the usual reasons that competitive markets are incapable of adequately providing information — to say nothing of the difficulty of telling whether the raters know what they're talking about. 4. Universities are accredited because students and parents would otherwise be in a market for lemons. Universities themselves, however, can tell how skilled those selling academic services are — or at least they're supposed to have that ability. 5. I should re-read Phil Agre on the professionalization of everything and see if it holds up.)

The Dismal Science

Posted by crshalizi at October 13, 2009 22:26 | permanent link

October 09, 2009

Twilight of the Market Gods

My review of Justin Fox's Myth of the Rational Market in American Scientist is out. (Shorter me: read the book.) Sometime soon I'll put up a version with links, which alas don't work in print.

Manual trackback: 3 Quarks Daily

The Dismal Science

Posted by crshalizi at October 09, 2009 00:54 | permanent link

October 08, 2009

Wit and Wisdom of Pittsburgh Bar Patrons (Part 1)

"They [= the Steelers] are like this utterly adorable, totally hot girl next door, who you suddenly realize is everything you've ever wanted in a football team — I mean, girlfriend."

Heard About Pittsburgh PA

Posted by crshalizi at October 08, 2009 23:00 | permanent link

"Completely Random Measures for Bayesian Nonparametrics" (This Year at the DeGroot Lecture)

Attention conservation notice: Only of interest if you (1) care about specifying probability distributions on infinite-dimensional spaces for use in nonparametric Bayesian inference, and (2) are in Pittsburgh.

The CMU statistics department sponsors an annual distinguished lecture series in memory of our sainted founder, Morris H. DeGroot. This year, the lecturer is Michael Jordan. (I realize that's a common name; I mean the one my peers and I wanted to be when we grew up.)

"Completely Random Measures for Bayesian Nonparametrics"
Abstract: Bayesian nonparametric modeling and inference are based on using general stochastic processes as prior distributions. Despite the great generality of this definition, the great majority of the work in Bayesian nonparametrics is based on only two stochastic processes: the Gaussian process and the Dirichlet process. Motivated by the needs of applications, I present a broader approach to Bayesian nonparametrics in which priors are obtained from a class of stochastic processes known as "completely random measures" (Kingman, 1967). In particular I will present models based on the beta process and the Bernoulli process, and will discuss an application of these models to the analysis of motion capture data in computational vision.
(Joint work with Emily Fox, Erik Sudderth and Romain Thibaux.)
Time and place: 4:15 pm on Friday, 16 October 2009, in the Giant Eagle Auditorium in Baker Hall (room A51)

Update: I counted over 210 people in the audience.

Enigmas of Chance

Posted by crshalizi at October 08, 2009 15:02 | permanent link

"High Dimensional Nonlinear Learning using Local Coordinate Coding" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about statistical learning in high-dimensional spaces and (2) are in Pittsburgh.

Since manifold learning has been on my mind this week, owing to trying to teach it in data-mining, I am extra pleased by the scheduling of this talk:

"High Dimensional Nonlinear Learning using Local Coordinate Coding"
Prof. Tong Zhang, Rutgers University
Abstract: We present a new method for learning nonlinear functions in high dimension using semisupervised learning. Our method includes a phase of unsupervised basis learning and a phase of supervised function learning. The learned bases provide a set of anchor points to form a local coordinate system, such that each data point on a high dimensional manifold can be locally approximated by a linear combination of its nearby anchor points, with the linear weights offering its local-coordinate coding. We show that a high dimensional nonlinear function can be approximated by a global linear function with respect to this coding scheme, and the approximation quality is ensured by the locality of such coding. The method turns a difficult nonlinear learning problem into a simple global linear learning problem, which overcomes some drawbacks of traditional local learning methods. The empirical success of our approach has been demonstrated in a recent pascal image classification competition, where the top performance was achieved by an NEC system using this idea.
(Joint work with Kai Yu at NEC Lab America.)
Time and place: 4 pm on Monday, 12 October 2009, in Doherty Hall 310

As always, the seminar is free and open to the public.

Enigmas of Chance

Posted by crshalizi at October 08, 2009 15:01 | permanent link

October 05, 2009

In re John Holland

Having vowed two weeks ago to post something positive at least once a week, I missed last week, with the excuse of being back in Ann Arbor for the celebration of John Holland's 80th birthday at the Center for the Study of Complex Systems. There was no time to post, or even to see everyone I wanted to, but I did actually start writing something about Holland's scientific work, only to realize yesterday I was merely engaged in self-plagiarism, from this, this and this, and probably other things I'd written too, because reading Holland has quite profoundly shaped my thinking. So I'll just point you to the back-catalogue, as it were, and get back to revising a paper I'd never have written if I hadn't read Adaptation in Natural and Artificial Systems.

(So long as I'm talking about the workshop, and without any slight to the other presentations, the neatest work was that by Stephanie Forrest et al. on using genetic programming to evolve bug fixes.)

Complexity; Minds, Brains, and Neurons; Enigmas of Chance

Posted by crshalizi at October 05, 2009 14:30 | permanent link

October 02, 2009

"Analyzing Networks and Learning with Graphs"

See you in Whistler?

Analyzing Networks and Learning with Graphs


a workshop in conjunction with

23nd Annual Conference on Neural Information Processing Systems (NIPS 2009)


December 11 or 12, 2009 (exact date TBD) Whistler, BC, Canada

Deadline for Submissions: Friday, October 30, 2009
Notification of Decision: Monday, November 9, 2009

Overview:

Recent research in machine learning and statistics has seen the proliferation of computational methods for analyzing networks and learning with graphs. These methods support progress in many application areas, including the social sciences, biology, medicine, neuroscience, physics, finance, and economics.

The primary goal of the workshop is to actively promote a concerted effort to address statistical, methodological and computational issues that arise when modeling and analyzing large collection of data that are largely represented as static and/or dynamic graphs. To this end, we aim at bringing together researchers from applied disciplines such as sociology, economics, medicine and biology, together with researchers from more theoretical disciplines such as mathematics and physics, within our community of statisticians and computer scientists. Different communities use diverse ideas and mathematical tools; our goal is to to foster cross-disciplinary collaborations and intellectual exchange.

Presentations will include novel graph models, the application of established models to new domains, theoretical and computational issues, limitations of current graph methods and directions for future research.

Online Submissions

We welcome the following types of papers:
  1. Research papers that introduce new models or apply established models to novel domains,
  2. Research papers that explore theoretical and computational issues, or
  3. Position papers that discuss shortcomings and desiderata of current approaches, or propose new directions for future research.
All submissions will be peer-reviewed; exceptional work will be considered for oral presentation. We encourage authors to emphasize the role of learning and its relevance to the application domains at hand. In addition, we hope to identify current successes in the area, and will therefore consider papers that apply previously proposed models to novel domains and data sets.

Submissions should be 4-to-8 pages long, and adhere to NIPS format. Please email your submissions to: nipsgraphs2009 [at] gmail [dot] com

Workshop Format

This is a one-day workshop. The program will feature invited talks, poster sessions, poster spotlights, and a panel discussion. All submissions will be peer-reviewed; exceptional work will be considered for oral presentation. More details about the program will be announced soon.

Organizers

Networks; Enigmas of Chance; Incestuous Amplification

Posted by crshalizi at October 02, 2009 10:24 | permanent link

September 30, 2009

Books to Read While the Algae Grow in Your Fur, September 2009

Alexander Rosenberg, Economics: Mathematical Politics or Science of Diminishing Returns?
Rosenberg is a philosopher of science focused on biology and neo-classical economics. This is his second attempt at assessing the latter. (I haven't read his first go-round and it doesn't seem to be necessary.) Ordinarily, he says, the goal of science is to make increasing accurate and precise predictions about the world. (This includes the historical sciences like geology and paleontology, which predict new evidence rather than new events at definite future times. [Different branches of astronomy actually make both kinds of predictions.]) Economics, however, is incredibly bad at prediction — certainly at the improving-precision part. Rosenberg does not argue this point very strongly, taking it to be more or less obvious to anyone familiar with the state of economics, and I'm inclined to agree. This picture might be complicated by a detailed consideration of applied econometric models, but even when those work, they are very poorly grounded in economic theory*. (Incidentally, one of the pleasures of reading this was seeing Rosenberg assault Friedman's "Methodology of Positive Economics" essay, whose influence has been profound and utterly malign.) Rosenberg then has two questions: (1) if economics does share the usual goal of science, what are its prospects for achieving it? (2) if it does not have that goal, what is it trying to achieve — or, perhaps, better, what is the kind of thing economists do and want to keep doing fitted to achieve?
As to (1) he is intensely skeptical, because he sees microeconomic explanations as grounded in intentional explanations, a not-too-compelling formalization of folk psychology (desires mapping to utility functions and beliefs to subject probability distributions). He is extremely skeptical, on strictly philosophical grounds, about intentional explanations being made much better in the future than they have been through recorded history. His also skeptical that we could ever have something like a cognitive-scientific or neuro-scientific theory which explains behavior and recovers folk psychology as a useful approximation in certain domains; I completely fail to follow his argument here. What I seem to understand would imply that he thinks thermostats and self-guided missiles are impossible, so if I am right he should really listen to Uncle Norbert, especially before the inevitable robot uprising (I can just see his last words being, a la pp. 142--143, "this robot can't really be trying to kill me, because if that intention were represented in one part of its computational system, l1, who is the interpreter who treats the configuration of memory registers in l1 as expressing this goal? Surely it must be some other sub-system, call it l2, which reads l1, but then we face the same question all over again for l2 — urk!"); but no doubt I am wrong and he has some more reasonable idea which he does not, however, convey. The resounding experimental failure of maximizing-subjectively-expected-utility theory is not addressed. (Perhaps this was less clear in 1994 than it is now, but I doubt it.) I suspect that he would feel any of the models of choice proposed in behavioral economics are subject to the same critique, mutatis mutandis, that he makes of conventional microeconomics, because they're basically intentional.
As to (2), Rosenberg argues as follows. There is a three-way relationship between a discipline's goals, its theories, and its methods: given the goals (say, maximizing predictive accuracy), the theories tell us something about how well different methods will meet the goals. Likewise if you fix the goals and methods, only certain kinds of theories will be acceptable or reachable. And if you fix the theories and methods, you constrain the goals you can attain. (Rosenberg's argument here is very close to that of Larry Laudan in his great book Science and Values, and I think it's correct.) If we take neo-classical methods and theories as given, what might economics be successfully aiming at? Clearly not, by the previous argument, scientific prediction. Rather, Rosenberg offers two possibilities, not mutually exclusive. On the one hand, maybe it's really a species of hyper-formalized social-contract theory from political philosophy, with (as he says) the Walrasian auctioneer in the role of Hobbes's sovereign. Or: maybe it's a species of applied mathematics, interested in the implications of interacting transitive preference orderings. As he says, applied mathematicians are rarely interested in whether their math can, in fact, be applied to the real world — that's not their department.
Excusing economics's poor track-record as an empirical science by saying it's really political philosophy and/or applied math may be a defense worse than the original accusation. As Rosenberg notes, it makes the idea of attending to what economists have to say about policy matters rather odd; at best one should listen to them as much as to any other sect of political philosophers. I would suspect that Rosenberg was proposing this maliciously, but he seems to be sincere and not just good at writing with a straight face. I don't think economics is in quite such a plight as he does, but having just put the book down I admit I'm hard-pressed to articulate why.
*: For instance, when real business cycle theorists and their kin fit dynamic stochastic "general equilibrium"** models to empirical time-series of macro-economic quantities, these time series are first de-trended, i.e., made stationary. This has no justification in the representative-agent story underlying the models, but seems, at present, to be essential to actually getting estimates. Typically the de-trending is done through the "Hodrick-Prescott" filter***, again with no theoretical justification, and the business cycle is operationally defined as "the residuals of the filter". I suspect that most of the predictive ability of DSGEs comes from the filter, plus implicitly doing a moving-average smoothing of the residuals. It would be interesting to pit them against economically-naive nonparametric forecasting (along say these lines).
**: I use the scare-quotes because I don't agree that representative agent models are general equilibrium models.
***: Known in statistics decades before Hodrick and Prescott as a "smoothing spline". (The word "spline" does not appear in their paper, and they are entirely innocent of the vast literature on how much smoothing to do.)
Sarah Graves, Wreck the Halls
More cozy comfort-reading about sordid multiple homicide. (But whatever happened to Sam's girlfriend from the previous book?) — Sequels.
John Billheimer, Highway Robbery
Well-written, amusing and absorbing mystery novel about a family of highway engineers in West Virginia. The only thing keeping it from being perfect-for-me mind-candy is that part of the plot turns on making fun of environmentalists; but you can't have everything. This is the second book in a series; I've not read the others but will look them up.
Bent Jesper Christensen and Nicholas M. Kiefer, Economic Modeling and Inference
Review: An Optimal Path to a Dead End.
Joe Hill and Gabriel Rodriguez, Locke and Key, vol. 2: Head Games
High-grade comic book mind-candy. Definitely needs the earlier book.
Chelsea Cain, Evil at Heart
Great, if somewhat stomach-turning, mind-candy. (Probably needs the earlier books.) I actually wish Cain did more with the media-frenzy angle, however.
Possible continuity error: Isn't Susan awfully unconcerned about leaving her car alone in really dodgy neighborhoods, after she broke out one of its windows?
Thomas Levenson, Newton and the Counterfeiter: The Unknown Detective Career of the World's Greatest Scientist
Or: Newton demands the noose. — A wonderfully readable little biography of Newton, with the hook of looking at how he tackled his second career as Warden of the Mint, in charge of actually producing the English currency, and of catching and punishing counterfeiters. In particular, Levenson focuses on Newton's pursuit of a counterfeiter of particular skill and temerity, one William Chaloner, providing a great opportunity to explain the criminal underworld in which such figures lived, and the vast opportunities opening up for them as a result of the social transformations of which Newton was at once symbol, beneficiary and further driver. (Any idiot understands stealing a hunk of metal, and almost any idiot can grasp substituting pewter for silver, but the higher reaches of monetary crime require numeracy and comfort with sophisticated abstractions.) This is, in short, a portrait of the foundations of our world being laid, from the intellectual system of rational scientific explanation, to states powerful enough to enforce written laws on millions and raise the funds needed to wage war across the world, to through global commerce and flows of money, and stock-market Ponzi schemes in which geniuses lose fortunes. Enthusiastically recommended if any of this sounds the least bit appealing.
Kat Richardson, Vanished
Mind-candy: An American shaman in London. Ends in media res, though not with a cliff-hanger.
Halbert White, Estimation, Inference, and Specification Analysis
Review: How to Tell That Your Model Is Wrong; and, What Happens Afterwards.
House of Mystery: Love Stories for Dead People
More tales from the bar, plus a really unfortunate basement.
Tiziano Scalvi et al., The Dylan Dog Case Files
No purchase link because I actually dis-recommend it: predictable, tedious, implausible, not scary, excruciating when it tries to be funny, ultimately tiresome. (The drawing is I admit pretty good, but nowhere near the covers Mignola provides for the translated edition.) Is this really that popular in Italy? If so, does the original have virtues which did not survive translation, or does the old country simply have no taste at all in comics?
Phil and Kaja Foglio, Agatha Heterodyne and the Circus of Dreams and Agatha Heterodyne and the Clockwork Princess
Volumes 4 and 5 of Girl Genius. Go read.
I. J. Parker, The Convict's Sword
Converging murder cases in Heian-era Japan. Stands alone, but I enjoyed it more for knowing the back-story. (Previous volumes in the series: 1 and 2, 3, 4, 5.)
Madeleine E. Robins, Point of Honour
Your basic hard-boiled female private-eye detective novel, which also happens to be a historical mystery and a Regency romance; the charming love-child of Jane Austen, or perhaps Georgette Heyer, and Dashiell Hammett. I read it in one sitting from the beginning — "It is a truth universally acknowledged that a Fallen Woman of good family must, soon or late, descend to whoredom" — to the end, and really want the sequel.
(Read following up on an old review by Kate Nepveu.)
Update: The sequel is as good.

Books to Read While the Algae Grow in Your Fur; Enigmas of Chance; The Dismal Science; Pleasures of Detection, Portraits of Crime; Scientifiction and Fantastica; Philosophy; Writing for Antiquity; The Great Transformation

Posted by crshalizi at September 30, 2009 23:59 | permanent link

September 25, 2009

Miniature Pearl

It would be wrong to say that Judea Pearl knows more about causal inference than anyone else — I can think of some rivals very close to where I'm writing this — but he certainly knows a lot, and has worked tirelessly to formulate and spread the modern way of thinking about the subject, centered around graphical models and their associated structural equations. I remember spending many happy hours with his book Causality when it came out in 2000, and look forward to spending more with the new edition, which is making its way to me through the mail now. In the meanwhile, however, there is what he describes as "A new survey paper, gently summarizing everything I know about causation (in only 43 pages)":

"Causal Inference in Statistics: An Overview", forthcoming in Statistics Surveys 3 (2009): 96--146 [Free PDF]
Abstract: This review presents empirical researchers with recent advances in causal inference, and stresses the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underly all causal inferences, the languages used in formulating those assumptions, the conditional nature of all causal and counterfactual claims, and the methods that have been developed for the assessment of such claims. These advances are illustrated using a general theory of causation based on the Structural Causal Model (SCM) described in Pearl (2000a), which subsumes and unifies other approaches to causation, and provides a coherent mathematical foundation for the analysis of causes and counterfactuals. In particular, the paper surveys the development of mathematical tools for inferring (from a combination of data and assumptions) answers to three types of causal queries: (1) queries about the effects of potential interventions, (also called "causal effects" or "policy evaluation") (2) queries about probabilities of counterfactuals, (including assessment of "regret," "attribution" or "causes of effects") and (3) queries about direct and indirect effects (also known as "mediation"). Finally, the paper defines the formal and conceptual relationships between the structural and potential-outcome frameworks and presents tools for a symbiotic analysis that uses the strong features of both.

The paper assumes a reader who's reasonably well-grounded in statistics, though not necessarily in the causal-inference literature. (Of such readers, I imagine applied economists might have more unlearning to do than most, because they will keep asking "but when do I start estimating beta?") It's not ideally calibrated for an reader coming from, say, machine learning.

One theme running through the paper is the futility of trying to define causality in purely probabilistic terms, and the fact that cases where it looks like one can do so are really cases where causal assumptions have been smuggled in. Another is that once you realize counterfactual or mechanistic assumptions are needed, the graphical-models/structural equation framework makes it immensely easier to reason about them than does the rival "potential outcomes" framework. In fact, the objects which the potential outcomes framework takes as its primitives can be constructed within the structural framework, so the correct part of the former is a subset of the latter. And by reasoning on graphical models it is easy to see that confounding can be introducing by "controlling for" the wrong variables, something explicitly denied by leading members of the potential-outcomes school. (Pearl quotes them making this mistake, and manages to pull off a more-in-sorrow-than-in-glee tone while doing so.) Mostly, however, the paper is about showing off what can be done within the new framework, which is really pretty impressive, and ought to be part of the standard tool-kit of data analysis. If you are not already familiar with it, this is an excellent place to begin, and if you are you will enjoy the elegant and comprehensive presentation.


Looking back over what I write in this blog, I feel like, on the one hand, there's too little of it lately, and on the other hand, it's too tilted towards negative, critical stuff. While not regretting at all being negative and critical about stupid ideas that need to be criticized (or, really, pulverized), I will try to expand and balance my output by posting at least once a week on some good science. We'll see how this goes.


Enigmas of Chance

Posted by crshalizi at September 25, 2009 10:12 | permanent link

September 23, 2009

Next Week at the Statistics Seminar: Selecting Demanding Models

Attention conservation notice: Only of interest if you (1) care about statistical model selection and (2) are in Pittsburgh on Monday afternoon.
"Composite Likelihood Bayesian Information Criteria for Model Selection in High-Dimensional Data"
Prof. Peter Song, Dept. of Biostatistics, University of Michigan
Abstract: For high-dimensional data with complicated dependency structures, the full likelihood approach often renders to intractable computational complexity. This imposes difficulty on model selection as most of the traditionally used information criteria require the evaluation of the full likelihood. We propose a composite likelihood version of the Bayesian information criterion (BIC) and establish its consistency property for the selection of the true underlying model. Under some mild regularity conditions, the proposed BIC is shown to be selection consistent, where the number of potential model parameters is allowed to increase to infinity at a certain rate of the sample size. Simulation studies demonstrate the empirical performance of this new BIC criterion, especially for the scenario that the number of parameters increases with the sample size.
Place and Time: 4--5 pm on Monday, 28 September 2009, in Doherty Hall A310

As always, the seminar is free and open to the public.

Enigmas of Chance

Posted by crshalizi at September 23, 2009 16:53 | permanent link

September 09, 2009

My Combinatorics Problem, Let Me Show You It

Because I am too lazy to solve it myself, and too dumb to find it in the literature: How many ways are there to file N letters among k folders, such that no folder is empty? (What I am really interested in doing is counting the number of non-isomorphic functions of a discrete random variable.) The first reader to provide a solution gets a free year's subscription to the blog.

Update, 2 hours later: Thanks to reader M.L. for pointing me to Stirling numbers of the second kind, and to readers L.Z. and Jake Hofman for reading the sentence in parentheses and directing me to Bell numbers and counting surjections. Also to the four (!) others who wrote in after them with solutions. — Now that I see the answer, I suspect I did this problem in one of David Griffeath's classes...

Mathematics

Posted by crshalizi at September 09, 2009 14:33 | permanent link

"Two Problems on Stirring Processes --- and their Solutions" (This Week at the Statistics Seminar)

Attention conservation notice: Irrelevant unless you are (a) interested in interacting particle systems and/or stochastic processes on graphs, and (b) in Pittsburgh.

This week (in fact, tomorrow!) at the statistics seminar:

Thomas M. Liggett, "Two Problems on Stirring Processes — and their Solutions"
Abstract: Consider a finite or infinite graph G=(V,E), together with an assignment of nonnegative rates ce to e \in E. For each edge e, there is a Poisson process \Pie of rate ce. A Markov process is defined on this structure by placing labels on V, and then interchanging the labels at the two vertices joined by e at the event times of \Pie. If one considers the motion of just one label, this is a random walk on V. If all labels are 0 or 1, it is the symmetric exclusion process on V. If all labels are distinct, it is a random walk on the set of permutations of V. In this talk, I will describe recent solutions to two problems about these processes:
1. Suppose G=Z1 and ce=1 for each edge. Consider the symmetric exclusion process starting with the configuration ... 1 1 1 0 0 0 .... , and let Nt be the number of 1's to the right of the origin at time t. This is sum of negatively correlated Bernoulli random variables. In 2000, R. Pemantle asked whether Nt satisfies a central limit theorem. I will explain how a new negative dependence concept leads to a a positive answer to this question. This is partly based on joint work with J. Borcea and P. Brändén.
2. Suppose G is the complete graph on n vertices, and consider both the random walk on V and the random walk on the set of permutations of V. Each is a reversible, finite state, Markov chain, with n and n! states respectively. The exponential rate of convergence to equilibrium (which is the uniform distribution) for such a chain is determined by the smallest non-zero eigenvalue of -Q, where Q is the transition rate matrix of the chain. Let l1 and l2 be these values for the two processes. It is elementary that l2 =< l1. Based on explicit computations in some special cases, D. Aldous conjectured in 1992 that l1=l2. I will describe some elements of the approach that leads to a proof of this conjecture. This is joint work with P. Caputo and T. Richthammer.
Time and Place: Thursday, 10 September 2009, 4:30--5:30 pm, Giant Eagle Auditorium (Baker Hall A53)

The seminar is free and open to the public.

Enigmas of Chance; Mathematics

Posted by crshalizi at September 09, 2009 14:24 | permanent link

August 31, 2009

Books to Read While the Algae Grow in Your Fur, August 2009

Nadia Gordon, Murder Alfresco
Bright and amusing murder mystery among Napa Valley foodies. Sequel to Death by the Glass, followed by Lethal Vintage.
Roberto Bolaño, Nazi Literature in the Americas
Capsule literary biographies of thirty imaginary fascist writers, from the US to (of course) Chile and Argentina. It gains from two remarkable achievements on Bolaño's part: first, while everything is made up, nothing is exaggerated (I feel certain I have read books by both J. M. S. Hill and Zach Sodenstern, and as for the Colonia Renacer in "Willy Schürholz", well...); and second, his literary Nazis are not just caricatures, and in some cases (e.g., "Irma Carrasco") actually affecting. Emphatically recommended if you are in the mood for black hilarity.
Thomas Harlan, Land of the Dead
Third volume of his series of Lovecraftian alternate-history space operas. (On which, see here.) More stuff-blowing-up than I remember from the earlier books, but still plenty of ancient extraterrestrial secrets man was not meant to know.
Greg Rucka and Steve Lieber, Whiteout: Melt
More Antarctic crime-fiction, this time in thriller rather than mystery mode.
The Middleman
I think I am demographically compelled to be charmed by this; and I am.
G. Willow Wilson and M. K. Perker, Air: Letters from Lost Countries
Shivers, a.k.a. They Came From Within
Still scary and disturbing, despite decades of intervening horror movies about zombies, parasites crawling inside people, etc. (Ebert's contemporary review is informative without having much spoilerage.) The crisis was effective, and (in a twisted way) romantic.
Observation: lots of attitudes have certainly changed ("she was twelve"); and while gadgets, clothes, cars, look out-of-date, the kind of life depicted is still very much ours. You could remake this today with hardly a change to the plot at all, except that you'd need keep everybody's cell-phones from working.
Query: why did Romero's zombies (i.e., the re-animated dead) take over the world, rather than Cronenberg's parasite hosts? Did the latter involve too much sex to fly commercially?
Alexandra Sokoloff, The Harrowing
Pleasingly creepy ghost-story about emotionally-scarred undergrads who really should not have played around with Ouija boards. First novel; good enough that I'll look for her others.
Criminal Minds
When did the networks start broadcasting Shadow Unit fan-fiction? (And is it too late for me to change the grade of the student last year who "accidentally" referred to me as "Dr. Reid"?) — Series fatigue set in for me about half-way through the third season.
ObLink: Gladwell on profiling.
Charles Manski, Identification for Prediction and Decision
Review: Better Roughly Right than Exactly Wrong.
Jacob Kogan, Introduction to Clustering Large and High-Dimensional Data
What it says on the label. A short book (160 pp., excluding math review appendix and problem solutions) exclusively devoted to "hard" or "crsip" non-hierarchical clustering, mostly ignoring statistical issues in favor of computational ones, and emphasizing methods that scale to large problems. This includes ingenious tricks for replacing the difficult optimization problems implicit in most clustering algorithms with tractable smooth approximations. The old standby k-means algorithm plays a bigger role than I'd have guessed. Mostly pretty clear, though not an outstanding piece of exposition. (I think the BIRCH algorithm on pp. 42--43 is wrong as written, since step 2 seems to be redundant!) More useful for those in the field, I think, than to those who just want to cluster some data and get on with their lives.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Enigmas of Chance; Pleasures of Detection; Cthulhiana; The Running Dogs of Reaction; The Commonwealth of Letters; The Dismal Science

Posted by crshalizi at August 31, 2009 23:59 | permanent link

August 20, 2009

"Econometric Shrinkage and Model Averaging" (Week-after-next at the Statistics Seminar)

Attention conservation notice: Irrelevant unless you are (a) interested in combining statistical models and (b) in Pittsburgh.

Week after next at the statistics seminar:

Bruce E. Hansen, "Econometric Shrinkage and Model Averaging"
Abstract: Model uncertainty is pervasive in applied econometrics. The traditional solution of model selection is being supplanted by the concept of model averaging. When there are two nested models, model averaging is equivalent with shrinkage estimation. In econometrics, shrinkage theory has been confined to the linear Gaussian regression model, precluding application to most econometric contexts. In this talk, I show that we can apply the modern theory of statistical shrinkage to parametric econometric estimators, including GMM and MLE. The result is that we can construct shrinkage estimators which globally dominate conventional unrestricted GMM and MLE estimators. I extend the classic theory by allowing for arbitrary estimators and weight matrices, and I show how the methods can be used to separate parameters of interest from nuisance parameters. The reduction in risk from shrinkage can be substantial.
Model averaging generalizes shrinkage to the case where the number of models exceeds two. Non-Bayesian model averaging methods have been developed by the author in previous work. The talk will discuss the development of non-Bayesian model averaging methods for general econometric estimators.
Time and Place: Monday, 31 August 2009, 4--5 pm, Doherty Hall 310

The seminar is free and open to the public. Contact me if you would like to meet with Prof. Hansen during his visit to CMU.

Enigmas of Chance; The Dismal Science

Posted by crshalizi at August 20, 2009 14:55 | permanent link

August 17, 2009

Course Announcement: 36-350, Data Mining, Fall 2009

Since the semester starts in a lamentably small number of days:

Title: 36-350, Statistical Data Mining
Prereqs: One of 36-226, 36-310, 36-625, or consent of instructor. In addition, familiarity with vectors and matrices, and comfort with programming, will be very helpful.
Lectures: MWF 10:30--11:20, Porter Hall 226B. (The on-line class schedule thinks the Friday lecture is a lab; it's wrong.)
Course description:
Data mining is the art of extracting useful patterns from large bodies of data; finding seams of actionable knowledge in the raw ore of information. The rapid growth of computerized data, and the computer power available to analyze it, creates great opportunities for data mining in business, medicine, science, government, etc. The aim of this course is to help you take advantage of these opportunities in a responsible way. After taking the class, when you're faced with a new problem, you should be able to (1) select appropriate methods, and justify their choice, (2) use and program statistical software (i.e., R) to implement them, and (3) critically evaluate the results and communicate them to colleagues in business, science, etc.
Data mining is related to statistics and to machine learning, but has its own aims and scope. Statistics is a mathematical science, studying how reliable inferences can be drawn from imperfect data. Machine learning is a branch of engineering, developing a technology of automated induction. We will freely use tools from statistics and from machine learning, but we will use them as tools, not things to study in their own right. We will do a lot of calculations, but will not prove many theorems, and we will do even more experiments than calculations.

The current topic outline, the grading policy, etc., can all be found on the class webpage. This will mostly be very similar to the 2008 iteration of the class, since it seemed to work, with some modifications in light of that experience. Podcast lectures are probably not going to happen, owing to technical incompetence on my part.

(Oh, and in case you're wondering: I'm behind on answering everyone else's email too, not just yours.)

Corrupting the Young; Enigmas of Chance; Self-Centered

Posted by crshalizi at August 17, 2009 14:17 | permanent link

July 31, 2009

Books to Read While the Algae Grow in Your Fur, July 2009

Greg Rucka and Steve Lieber, Whiteout
Antarctic murder-mystery comic book.
Ed Brubaker, Greg Rucka and Michael Lark, Gotham Central: In the Line of Duty
It's Homicide in comic-book form. How can I not enjoy it? (Even the appearances of superheroes are OK.)
Fall of Cthulhu: Apocalypse
Thoroughly incomprehensible without the earlier installments in the saga; with them, a satisfactory conclusion.
Katha Pollitt, The Mind-Body Problem: Poems
It's hard to pick just one to quote, they're all so good; how about "Atlantis"?
Dreaming of our golden boulevards and temples
our painted palaces set in torchlit gardens,
our spires and minarets, our emerald harbor,
you won't want to hear about the city we knew:

the narrow neighborhoods of low white houses
where workmen come home for lunch and an afternoon nap,
old women in sweat-stained penitential black
ease their backaches gratefully against doorways

and the widow who keeps the corner grocery
anxiously watches her child dragging his toy
who was sickly from birth and everyone knows must die soon.
You won't want to know how we lived,

the hot sun, the horse traders cheating each other out of boredom,
in the brothels the prostitutes curling each other's hair
while the madam limps upstairs to feed the canary,
the young louts smoking in bare cafés

where old men play dominoes for glasses of cognac—
and how can we blame you?
We too were in love with something we never could name.
We never could let ourselves say

that the way the harbor flashes like bronze at sunset
or the hill towns swam in the twilight like green stars
were only tricks of the light and meant nothing.
We too believed that a moment would surely come

when our lives would stand hard and pure, like marble statues.
And because we were, after all, only a poor city,
a city like others, of sailors' bars and sunflowers,
we gave ourselves up to be only a name,

an image of temples and spires and jeweled gardens—
for which reasons we are envied of all peoples,
and even now could not say
what life would have to be, for us to have chosen it.

Pollitt has a pretty big emotional range as a poet, but some of the things you see here — the mundane details, the melancholy strain, the yearning for some half-sensed thing beyond — are recurring and effective. (There, now I have propitiated the gods of fair use.)
Adam-Troy Castro, Emissaries from the Dead
Mind-candy: scientifictional mystery story. I had to read it once I knew it featured murderous giant alien sloths, and I have a weakness for deliberately abrasive detectives who get things done. I'll look up the sequel, but hope there are fewer moments of personal growth. Spoilers: Castro does not seem to have thought through the Nigh-All-knowing Alien AIs business. In the first place, wouldn't any government ever be intensely suspicious about letting them tamper with human bodies for medicine? And, NAAAIs finding human beings interesting because they're so unpredictable and creative, but only if the neurological mind control powers aren't applied too forcefully? That sort of thinking was out of date a long time ago.
Philip P. Wiener, Evolution and the Founders of Pragmatism
I read this book in 2002, when it was called Louis Menand's The Metaphysical Club; oddly, Wiener's was published in 1949. — Seriously, there are many parallels: the biographical approach, the emphasis on the impact of Darwin and the discussions at the "metaphysical club" in Cambridge in the 1870s, the inclusion of Holmes as a major figure, the summing-up of the usable elements of classical pragmatism... Menand indulges in a lot more guesswork about his subjects' inner lives than Wiener, and is a much more engaging writer than the latter, who is frankly rather dry. In all, it seems a bit un-generous of Menand to cite Wiener's book only once, as one of "two works of intellectual history that were especially germane" (p. 448).
Karin Slaughter, Undone
Slaughter restores my faith in thrillers where astoundingly gruesome crimes are solved by seriously messed up detectives. She also somehow manages to be funny.
Diana Rowland, Mark of the Demon
Loren R. Graham and Jean-Michel Kantor, Naming Infinity: A True Story of Religious Mysticism and Mathematical Creativity
Review: Proving Nothing.
Brandon Sanderson, The Final Empire
What happens when the Dark Lord of an epic fantasy triumphs, and rules with terror and magic for a thousand years? More specifically, this might perhaps best be understood as exploring the question "What if Aragorn had seized the One Ring?" (this is not a spoiler), unlike, say, Abercrombie's First Law trilogy, whose point of departure is discussed behind that link. — The first volume of (what else?) a trilogy; I'll be reading the others, but hope they say less about how magic works.
Frank Partnoy, F.I.A.S.C.O.: Blood in the Water on Wall Street
Hilarious, if ultimately very sad, memoir of being a derivative salesman at Morgan Stanley in the early 1990s. Some of the products (i.e., scams) were truly remarkable; Partnoy is good at explaining them.
Jessica Abel et al., Life Sucks
Bryan Lee O'Malley, Scott Pilgrim's Precious Little Life
Justin Fox, The Myth of the Rational Market: A History of Risk, Reward, and Delusion on Wall Street
Full-length review in progress. In the meanwhile, read Steve Laniel. — The review is out.
Sarah Langan, The Keeper
Wow: that was well-written and creepy as hell.
Spoilers, in which someone who can't tell an anecdote to save his life complains about the construction of a work of fiction: While the book really is great, I thought the ending showed a failure of nerve on Langan's part which hurt it for me. Susan Marley becomes the town's scapegoat, only to return from the dead to bring retribution for all the sins they hoped to escape, somewhere between zombie and angel of literally apocalyptic destruction. And then she stops because she gets to see that, despite all the horrible things done to her, life is never without hope and a bright side, and becomes a benign tutelary spirit, the "keeper" of the town. (I've loaned out my copy already so I can't quote exactly, but that was the gist of it.) Well, life should never be without those things, but it's really not hard to think of lots of examples where, to all appearances, it has been. I like happy endings even to my horror stories (perhaps especially to them), but I feel like Langan cheated to get this one, that either something was left out that would make this convincing, or that the story really ends in utter desolation.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; The Dismal Science Writing for Antiquity; Philosophy; The Commonwealth of Letters; Mathematics

Posted by crshalizi at July 31, 2009 23:59 | permanent link

June 30, 2009

Books to Read While the Algae Grow in Your Fur, June 2009

Walter Jon Williams, This Is Not a Game
Williams has been one of my favorite writers since I read Knight Moves in college, and I keep expecting him to get Discovered. With any luck, this novel — a techno-thriller about alternate reality games, gold-farming, third-world currency crises, social engineering on every possible scale, revenge, the power of story-telling, betrayal, moments of ad hoc solidarity, and generally what it's like to live in the early twenty-first century (only turned up to 10.5) — will do the trick. (But Dread Empire's Fall is remarkable too, to say nothing of the rest of his back-list.)
Update: see also Jo Walton.
Friedrich T. Sommer and Andrzej Wichert (eds.), Exploratory Analysis and Data Modeling in Functional Neuroimaging
Solid, if now a little dated. The chapters I found most interesting were those by Tang and Pearlmutter on independent component analysis, and by Tolias, Kourtzi and Logothetis on exploiting adaptation (habituation) to stimuli.
Kat Richardson, Underground
John Kenneth Galbriath, The Great Crash: 1929
I first read this book for a high school history class in October 1987, just before everything went ker-splat. It really does hold up exceptionally well: the origins of the bubble, its character, its collapse, and its aftermath, all elegantly laid out.
(One thing I've noticed from re-reading Galbraith is that I've gone from simply enjoying the way he could write to admiring it. Creating a style like that, maintaining it consistently and deploying it flexibly is hard.)
Laura E. Reeve, Peacekeeper
Mind-candy; about using alcohol to avoid dealing with guilt over having probably committed war crimes, but mind-candy. (Not only does the cover have nothing to do with the contents of the book, the heroine is clearly described in the first chapter and looks nothing like that.) Update: sequel.
Gillian Tett, Fool's Gold: How the Bold Dream of a Small Tribe at J. P. Morgan Was Corrupted by Wall Street Greed and Unleashed a Catastrophe
Full-length review: The Tragedy of Getting What You Want.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Enigmas of Chance; The Dismal Science; Minds, Brains, and Neurons

Posted by crshalizi at June 30, 2009 23:59 | permanent link

June 19, 2009

Page vs. Shalizi in Live

From now until 7:30, I am hosting a "salon" on Scott Page's The Difference at firedoglake. I would like to thank United Airlines for making sure that this announcement could not be made with any sort of reasonable lead-time.

(Title a silly reference to one of my favorite albums of all time.)

Posted by crshalizi at June 19, 2009 17:00 | permanent link

June 16, 2009

Because You Really Wished I'd Write More About Books

... I present for your amusement my reviews of Flynn on the Flynn Effect, and Tett on synthetic collateralized debt obligations and their kin. The former is the original version of something which had to be drastically cut down to fit into American Scientist; the latter appears nowhere else. (But, y'know, make me an offer.)

Anyone who tries to take this as an opportunity to drag me back into arguing about IQ will be ignored.

IQ; The Great Transformation; The Dismal Science; The Continuing Crises; Self-centered

Posted by crshalizi at June 16, 2009 15:37 | permanent link

On the Certainty of the Bayesian Fortune-Teller

Attention conservation notice: 2300 words of technical, yet pretentious and arrogant, dialogue on a point which came up in a manuscript-in-progress, as well as in my long-procrastinated review of Plight of the Fortune Tellers. Why don't you read that book instead?

Q: You really shouldn't write in library books, you know; and if you do, your marginalia should be more helpful, or less distracting, than just "wrong wrong wrong!"

A: No harm done; my pen and I are both transparent rhetorical devices. And besides, Rebonato is wrong in those passages.

Q: Really? Isn't his point that it's absurd to pretend you could actually estimate a something like a probability of an interest rate jump so precisely that there's a real difference between calling it 0.500 000 and calling it 0.499 967? Isn't it yet more absurd to think that you could get the 99.5 percent annual value-at-risk — the amount of money you'd expect to lose once in two thousand years — down to four significant figures, from any data set, let alone one that covers just five years and so omits "not only the Black Death, the Thirty Years' War, the Barbarian invasions, and the fall of the Roman Empire, but even the economic recession of 1991 — the only meaningful recession in the last twenty years" (as of 2006), to say nothing of the "famous corporate loan book crises of the Paleochristian era" (p. 218)?

A: Of course all that's absurd, and Rebonato is right to call people on it. By the time his book came out it was too late to do much good, but if people had paid attention to such warnings I dare say we wouldn't be quite so badly off now, and they had better listen in the future.

Q: So what's your problem. Oh, wait, let me guess: you're upset because Rebonato's a Bayesian, aren't you? Don't bother, I can tell that that's it. Look, we all know that you've got objections to that approach, but at this point I'm starting to think that maybe you have issues. Isn't this sort of reflexive hostility towards a whole methodology — something you must run into every day of work — awkward and uncomfortable? Embarrassing, even? Have you thought about seeking help?

A: Actually, I have a serious point to make here. What Rebonato wants is entirely right-headed, but it fits very badly with his Bayesianism, because Bayesian agents are never uncertain about probabilities; at least, not about the probability of any observable event.

Q: But isn't Bayesianism about representing uncertainty, and making decisions under uncertainty?

A: Yes, but Bayesian agents never have the kind of uncertainty that Rebonato (sensibly) thinks people in finance should have.

Q: Let me try to pin you down in black and white. [Opens notebook] I have here on one side of the page our old friend, the well-known the probability space Omega F. Prob. Coming out of it, in the middle, is a sequence of random variables X1, X2, ... , Xn, ... , which have some joint distribution or other. (And nothing really depends on its being a sequence, I could use a random field on a network or whatever you like, add in covariates, etc.) On the other side of the random variables, looking at them, I have a standard-issue Bayesian agent. The agent has a hypothesis space, each point m of which is a probability distribution for the random sequence. This hypothesis space is measurable, and the agent also has a probability measure, a.k.a. prior distribution, on this space. The agent uses Bayes's rule to update the distribution by conditioning, so it has a sequence of measures D0, D1, etc.

A: I think you are missing an "As you know, Bob", but yes, this is the set-up I have in mind.

Q: Now I pick my favorite observable event f, a set in the joint sigma-field of the Xi. For each hypothesis m, the probability m(f) is well-defined. The Bayesian thinks this is a random variable M(f), since it has a distribution D on the hypothesis space. How is that not being uncertain about the probability of f?

A: Well, in the first place —

Q: I am not interested in quibbles about D being a Dirac delta function.

A: Fine, assume that D doesn't put unit mass on any single hypothesis, and that it gives non-zero weight to hypotheses with different values of m(f). But remember how Bayesian updating works: The Bayesian, by definition, believes in a joint distribution of the random sequence X and of the hypothesis M. (Otherwise, Bayes's rule makes no sense.) This means that by integrating over M, we get an unconditional, marginal probability for f:

Pn(f) = EDn[M(f|X1=x1, X2=x2, ... , Xn=xn)]

Q: Wait, isn't that the denominator in Bayes's rule?

A: Not quite, that equation defines a measure — the predictive distribution — and the denominator in Bayes's rule is the density of that measure (with n=0) at the observed sequence.

Q: Oh, right, go on.

A: As an expectation value, Pn(f) is a completely precise number. The Bayesian has no uncertainty whatsoever in the probabilities it gives to anything observable.

Q: But won't those probabilities change over time, as it gets new data?

A: Yes, but this just means that the random variables aren't independent (under the Bayesian's distribution over observables). Integrating m with respect to the prior D0 gives us the infinite-dimensional distribution of a stochastic process, one which is not (in general) equal to any particular hypothesis, though of course it lies in their convex hull; the simple hypotheses are extremal points. If the individual hypothesis are (laws of) independent, identically-distributed random sequences, their mixture will be exchangeable. If the individual hypotheses are ergodic, their mixture will be asymptotically mean-stationary.

Q: Don't you mean "stationary" rather than "asymptotically mean-stationary"?

A: No; see chapter 25 here, or better yet that trifler's authority.

Q: You were saying.

A: Right. The Bayesian integrates out m and gets a stochastic process where the Xi are dependent. As far as anything observable goes, the Bayesian's predictions, and therefore its actions, are those of an agent which treats this stochastic process as certainly correct.

Q: What happens if the Bayesian agent uses some kind of hierarchical model, or the individual hypotheses are themselves exchangeable/stationary?

A: The only thing that would change, for these purposes, is the exact process the Bayesian is committed to. Convex mixtures of convex mixtures of points in C are convex mixtures of points in C.

Q: So to sum up, you're saying that the Bayesian agent is uncertain about the truth of the unobservable hypotheses (that's their posterior distribution), and uncertain about exactly which observable events will happen (that's their predictive distribution), but not uncertain about the probabilities of observables.

A: Right. (Some other time I'll explain how that helps make Bayesian models testable.) And — here's where we get back to Rebonato — all the things he is worried about, like values-at-risk and so forth, are probabilities of observable events. Put a Bayesian agent in the risk-modeling situation he talks about, and it won't just say that the 99.5% VaR is 109.7 million euros rather than 110 million, it will give you as many significant digits as you have time for.

Q: So let me read you something form p. 194--195:

Once frequentists accept (at a given statistical level of confidence) the point estimate of a quantity (say, a percentile), they tend to act as if the estimated number were the true value of the parameter. Remember that, for a frequentist, a coin cannot have a 40% chance of being biased. Either the coin is fair or it is biased. Either we are in a recession or we are not. We simply accept or reject these black-or-white statements at a certain confidence level... A Bayesian approach automatically tells us that a parameter (say, a percentile) has a whole distribution of possible values attached to it, and that extracting a single number out of this distribution (as I suggested above, the average, the median, the mode, or whatever) is a possibly sensible, but always arbitrary, procedure. No single number distilled from the posterior distribution is a primus inter pares: only the full posterior distribution enjoys this privileged status, and it is our choice what use to make of it.

This seems entirely reasonable; where do you think it goes wrong?

A: You mean, other than the fact that point estimates do not have "statistical levels of confidence", and that Rebonato has apparently forgotten about actual confidence intervals?

Q: Let's come back to that.

A: He is running together parameters of the unobserved hypotheses, and the properties of the predictive distribution on which the Bayesian acts. I can take any function I like of the hypothesis, g(m) say, and use it as a parameter of the distribution. If I have enough parameters gi and they're (algebraically) independent of each other, there's a 1-1 map between hypotheses and parameter vectors — parameter vectors are unique names for hypotheses. I could make parts of those names be readily-interpretable aspects of the hypothetical distributions, like various percentiles or biases. The distribution over hypotheses then gives me a distribution over percentiles conditional on the hypothesis M. But we don't know the true hypothesis, and on the next page Rebonato goes on to cast "ontological" doubt about whether it even exists. (How he can be uncertain about the state of something he thinks doesn't exist is a nice question.) We only have the earlier observations, so we need to integrate or marginalize out M, and this collapses the distribution of percentiles down to a single exact value for that percentile.

Q: Couldn't we avoid that integration somehow?

A: Integrating over the posterior distribution is the whole point of Bayesian decision theory.

Q: Let's go back to the VaR example. If you try estimating the size of once-in-two-thousand-year losses from five years of data, your posterior distribution has got to be pretty diffuse.

A: Actually, it can be arbitrarily concentrated by picking the right prior.

Q: Fine, for any reasonable prior it needs to be pretty diffuse. Shouldn't the Bayesian agent be able to use this information to avoid recklessness?

A: That depends on the loss function. If the loss involves which hypothesis happens to be true, sure, it'll make a difference. (That's how we get the classic proof that if the loss is the squared difference between the true parameter and the point estimate, the best decision is the posterior mean.) But if the loss function just involves what observable events actually take place, then no. Or, more exactly, it might make sense to show more caution if your posterior distribution is very diffuse, but that's not actually licensed by Bayesian decision theory; it is "irrational" and sets you up for a Dutch Book.

Q: Should I be worried about having a Dutch Book made against me?

A: I can't see why, but some people seem to find the prospect worrying.

Q: So what should people do?

A: I wish I had a good answer.. Many of Rebonato's actual suggestions — things like looking at a range of scenarios, robust strategies, not treating VaR as the only thing you need, etc. — make a lot of sense. (When he is making these practical recommendations, he does not counsel people to engage in a careful quantitative elicitation of their subject prior probabilities, and then calculate posterior distributions via Bayes's rule; I wonder why.) I would also add that there are such things as confidence intervals, which do let you make probabilistic guarantees about parameters.

Q: What on earth do you mean by a "probabilistic guarantee"?

A: That either the right value of the parameter is in the confidence set, or you get very unlucky with the data (how unlucky depends on the confidence level), or the model is wrong. Unlike coherence, coverage connects you to reality. This is basically why Haavelmo told the econometricians, back in the day, that they needed confidence intervals, not point estimates.

Q: So how did the econometricians come to make fetishes of unbiased point-estimators and significance tests of equality constraints?

A: No doubt for the same reason they became convinced that linear and logistic regression was all you'd ever need to deal with any empirical data ever.

Q: Anyway, that "get the model right" part seems pretty tricky.

A: Everyone is going to have to deal with that. (You certainly still have to worry about mis-specification with Bayesian updating.) You can test your modeling assumptions, and you can weaken them so you are less susceptible to mis-specification.

Q: Don't you get weaker conclusions — in this case, bigger confidence intervals — from weaker modeling assumptions?

A: That's an unavoidable trade-off, and it's certainly not evaded by going Bayesian (as Rebonato knows full well). With very weak, and therefore very defensible, modeling assumptions, the confidence interval on, say, the 99.5% VaR may be so broad that you can't devise any sensible strategy which copes with that whole range of uncertainty, but that's the math's way of telling you that you don't have enough data, and enough understanding of the data, to talk about once-in-two-thousand-year events. I suppose that, if they have financial engineers in the stationary state, they might eventually be able to look back on enough sufficiently-converged data to do something at the 99% or even 99.5% level.

Q: Wait, doesn't that suggest that there is a much bigger problem with all of this? The economy is non-stationary, right?

A: Sure looks like it.

Q: So how can we use statistical models to forecast it?

A: If you want someone to solve the problem of induction, the philosophy department is down the stairs and to the left.

Enigmas of Chance; The Dismal Science

Posted by crshalizi at June 16, 2009 09:50 | permanent link

Chaos, Complexity and Inference: 2009 Syllabus

See this earlier post or the course homepage for more. This should be an RSS feed for this page, so you can follow updates, which will generally be posted after lectures.

Final exam
take-home, due 12 May
Lecture 28, (Thursday, 30 April): Agents 4: A real-world example of agents on networks
Notes
Hedstrom and Aberg, "Quantitative research, agent-based modelling and theories of the social", ch. 6 (pp. 114--144) in Hedstrom, Dissecting the Social [PDF]; (*) Guttorp, ch. 4
Lecture 27, (Tuesday, 28 April): Networks 5: Community Discovery
No slides
Readings: Clauset, Moore and Newman, arxiv:0811.0484 (code); Girvan and Newman, arxiv:cond-mat/0112110; Hofman and Wiggins, arxiv:0709.3512; Newman, arxiv:physics/0605087; Reichardt and Bornholdt, arxiv:cond-mat/0603718;
Reichardt and White, arxiv:0708.0958 [discussion]
Lecture 26, (Thursday, 23 April): Complex networks 4: inference for network models
Slides
Reading: Hanneke and Xing, "Discrete Temporal Models of Social Networks" [PDF]; Handcock and Jones, "Likelihood-based inference for stochastic models of sexual network formation" [PDF]; Hunter, Goodreau and Handcock, "Goodness of Fit of Social Network Models" [PDF]; Middendorf, Ziv and Wiggins, "Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network", arxiv:q-bio/0408010; Newman, "Structure and Function", sections IV B and V; Newman, Strogatz and Watts, "Random graphs with arbitrary degree distributions and their applications", arxiv:cond-mat/0007235; Wiuf, Brameier, Hagberg and Stumpf, "A likelihood approach to analysis of network data", Proceedings of the National Academy of Sciences (USA) 103 (2006): 7566--7570 [discussion]
Lecture 25, (Tuesday, 21 April): Agents 3: social complexity
Slides
Reading: Flake, ch. 17; Miller and Page, chs. 10--11; Krugman, ch. 6; Salganik, Dodds and Watts, "Experimental study of inequality and unpredictability in an artificial cultural market", Science 311 (2006):854--856; (*) Skyrms and Pemantle, "A Dynamic Model of Social Network Formation", arxiv:math.PR/0404101
Lecture 24, (Thursday, 9 April): Complex networks 3: contagion on networks
slides
Reading: Guttorp, sec. 2.11; Newman, "Structure and Function", sec. VIII; Davis, Trapman, Leirs, Begon and Heesterbeek, "The abundance threshold for plague as a critical percolation phenomenon", Nature 454 (2008): 634--637; Bell, Maiden, Munoz-Solomando and Reddy, "'Mind control experiences' on the Internet: Implications for the psychiatric diagnosis of delusions", Psychopathology 39 (2006): 87--91 [PDF];
Smith and Novella, "HIV Denial in the Internet Era", PLoS Medicine 4:8 (2007): e256; (*) Kenah and Robins, "Second look at the spread of epidemics on networks", arxiv:q-bio.QM/0610057
Lecture 23, (Tuesday, 7 April): Agents 2: collective phenomena and self-organization
Slides
Reading: Flake, ch. 16; Miller and Page, ch. 9; Krugman, introduction and ch. 1; Healy, Walking to School; Perlstein, The Meaning of Box 722
Lecture 22, (Thursday, 2 April): Agent-based models 1
Slides
Reading: Miller and Page, chs. 6 and 7; Flake, ch. 12
Lecture 21, (Tuesday, 30 March): Complex networks 2: growth models
Slides
Reading: Newman, "Structure and function", sec. VII
Lecture 20, (Thursday, 26 March): Complex networks 1: basics, network properties
Slides
Definition of networks and basic terms. Pictures, emphasizing sex and death. Bipartite networks and the Galois lattice construction. The small world problem. What makes a network complex? The Erdos-Renyi model: first properties and weaknesses.
Watts, "The 'New' Science of Networks", Annual Review of Sociology 30 (2004): 243--270
Newman, "The Structure and Function of Complex Networks", SIAM Review 45 (2003): 167--256, arxiv:cond-mat/0303516 (through sec. VI, but skipping or skimming IV B and V)
Lecture 19, (Tuesday, 24 March): Inference from simulations 2: direct and indirect estimation
Slides and R
The method of simulated moments. Example with the logistic map. Issues with the MSM. The trade-off between tractable models and good models. The indirect inference trick: don't match moments, match a bad-but-tractable model. Examples of indirect inference. Convergence properties.
Reading: Smith, "Indirect Inference"; Kendall et al., "Population Cycles in the Pine Looper Moth: Dynamical Tests of Mechanistic Hypotheses"; (*) Gourieroux, Monfort and Renault, "Indirect Inference" [JSTOR]
Lecture 18, (Thursday, 19 March): Inference from simulations 1
The virtues and limits of simulations, especially stochastic simulations. Active testing: deliberately trying to break your model to gain (or lose!) confidence in it. Bootstrapping once again.
Reading: ch. 5 and appendix B of Miller and Page; Miller, "Active Nonlinear Tests (ANTs) of Complex Simulation Models"
Lecture 17, (Tuesday, 17 March): Inference in general: error statistics and severe testing
Slides
Errors and reliable inference. Statistics as principle argument: mostly arguments that certain errors are (or are not) absent or small, unless we're very unlucky. Kinds of errors in statistical inferences. Construction of reliable hypotheses: confidence sets, "probably approximately correct", more. Severe testing. Null hypotheses. Testing for mis-specification, especially by looking at residuals. The two chief world systems.
Reading: Mayo and Cox, "Frequentist statistics as a theory of inductive inference", arxiv:math.ST/0610846; Mayo and Spanos, "Methodology in Practice: Statistical Misspecification Testing" [PDF]; idem, "Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction" [journal]; Spanos, "Curve-Fitting, the Reliability of Inductive Inference and the Error-Statistical Approach" [journal]
Lecture 16, (Thursday, 5 March): Heavy-tailed distributions 4: Comparing models
Slides, R
Goodness-of-fit tests: Glivenko-Cantelli, "the fundamental theorem of statistics"; the Kolmgorov-Smirnov test; Monte Carlo version of the K-S test when parameters are estimated from data; general remarks on goodness-of-fit. Relative distribution methods for comparing distributions: their virtues; using relative distributions to check power laws; HTTP file size example. Vuong's likelihood-ratio test for model selection. The flight of the albatross. Zombie-hunting.
Readings: Clauset et al. continued; Handcock and Morris, "Relative Distribution Methods"; Freckleton and Sutherland, "Do power laws imply self-regulation?"; idem, "Do in-hospital waiting lists show self-regulation?" (thanks to Nick Watkins for point out those papers); Edwards et al. "Revisiting Lévy flight search patterns of wandering albatrosses, bumblebees and deer"; (*) Vuong, "Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses"
Lecture 15 (Tuesday, 3 March) Heavy-tailed distributions 3: Estimation
Slides
The right way to do it: maximum likelihood estimation for power laws; properties of the MLE. Nonparametric bootrstrapping for error estimation. The wrong way to do it: linear regression on a log-log plot; why it should never be used. Determining the scaling range. Examples with city sizes. Non-parametric density estimation, and special considerations for heavy-tailed distributions.
Readings: Clauset, Shalizi and Newman, "Power law distributions in empirical data", arxiv:0706.1062; (*) Markovitch and Krieger, "Nonparametric estimation of long-tailed density functions and its application to the analysis of World Wide Web traffic"
Lecture 14, (Thursday, 26 February): Heavy-tailed distributions 2: how they arise
Slides
"How the distributions got their tails". Deflating explanations: measuring ex instead of x; measuring x1/a instead of x; the central limit theorem for multiplicative terms. Everyday explanations: exponential growth from random times; why isn't Acoma, N.M., not the largest city in the US?; the Yule-Simon mechanism for random growth. Exciting and mysterious explanations: why physicists expect everything to follow a Gaussian, except at phase transitions; self-organized criticality; a skeptical note.
Readings: Newman, "Power laws", section IV; Krugman, ch. 3 and the last part of ch. 8; Mitzenmacher, "A Brief History of Generative Models for Power Law and Lognormal Distributions"; video of Mitzenmacher giving a talk on this material; Sornette, "Mechanism for Powerlaws without Self-Organization", arxiv:cond-mat/0110426
Lecture 13, (Tuesday, 24 February): Heavy-tailed distributions 1: what they are
Slides and R
Highly skewed and heavy tailed distributions: definitions. Real-data examples: words, cities, money, papers, phone calls, the Internet, earthquakes, wars, terrorism, solar flares ("die, puny humans!"). Models: the pure power laws (Pareto and Zipf distributions). Properties of power laws: scale-freedom, 80-20, order statistics. Inequality of condition in America. Modifications: generalized Pareto, Yule-Simon. Non-power-law distributions: truncated power laws, log-normal, stretched exponential/Weibull distribution.
Newman, "Power laws, Pareto distributions and Zipf's law", arxiv:cond-mat/0412004 (through section III)
Lecture 12, (Thursday, 19 February): Self-organization 2
Slides
Readings: Shalizi, Klinkner and Haslinger, "Quantifying Self-Organization with Optimal Predictors", arxiv:nlin.AO/0409024; Shalizi, Haslinger, Rouquier, Klinkner and Moore, "Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems", arxiv:nlin.CG/0508001
Lecture 11, (Tuesday, 17 February): Cellular automata 2: excitable media; mechanisms and abstractions
Slides
The heart as an assemblage of excitable elements. Excitable media as an abstract mechanism. Philosophical excursus on "abstraction" and "mechanism". Examples of excitable media. Chemical oscillators: Belousov-Zhabotinsky type. Chemical oscillators: Turing type. City growth as Turing morphogenesis. Slime mold morphogenesis. The spiral ice-canyons of Mars. Cellular automata models of excitability: Greenberg-Hastings, cyclic CA. Some math for these models.
Readings: Fisch, Gravner and Griffeath, "Threshold-range scaling of excitable cellular automata" [PDF]; idem, "Cyclic Cellular Automata in Two Dimensions" [zipped PostScript]; Greenberg and Hastings, "Spatial Patterns for Discrete Models of Diffusion in Excitable Media" [JSTOR]; Griffeath, "Self-Organization of Random Cellular Automata: four snapshot" [zipped PostScript]; Krugman, ch. 2 and first part of ch. 8; ch. 4 of Guttorp
Lecture 10, (Thursday, 12 February): Cellular automata 1
Slides
What cellular automata are; basic elements. Examples: diffusion-limited aggregation, lattice gases, majority vote, Life. Mechanical self-reproduction.
Readings: Flake, ch. 15; Miller and Page, ch. 8
Lecture 9, (Tuesday, 10 February): Self-organization 1, some examples
Slides
Levels of description; micro/macro distinctions, aggregation. Emergence of higher levels from lower levels; how to be a good reductionist. Self-organization. Excursus into thermodynamic entropy and the second law of thermodynamics. How self-organization is compatible with the second law. Some examples from fluids. Classroom demo with fluids.
Reading: Miller and Page, ch. 4
Office of Charles and Ray Eames, Powers of Ten, narration by Philip Morrison
Lecture 8, (Thursday, 5 February): Randomness and determinism
Slides
Data compression and description length. Algorithmic information content as the length of the shortest effective description. Random sequences are incompressible; incompressible sequences have all the testable properties of randomness. "Any signal distinguishable from noise is insufficiently compressed." Algorithmic information content is uncomputable. Dynamical systems as sources of randomness; Brudno's theorem connecting algorithmic information content to entropy rate and so to Lyapunov exponents. "Any determinism distinguishable from randomness is insufficiently complicated."
Side-note: Algorithmic Information Content and Marginal Entropies
Readings: Flake, ch. 14; Smith, ch. 11; Poincaré, "Chance", from Science and Method [PDF]
Lecture 7 (Tuesday, 3 February): Information Theory
Slides
Entropy of random variables, interpretations (randomness, variability, uncertainty, coding). Mutual information, statistical independence, Markov properties. Relative entropy, a.k.a. Kullback-Leibler divergence. Relative entropy measures how easy it is to tell two distributions apart: statistical uses in sampling, large deviations, estimation (via Fisher information). Divergence is is negative expected log likelihood; maximum likelihood attempts to minimize divergence. Entropy rates measure dynamical randomness; connection between metric entropy rate, topological entropy rate, and Lyapunov exponents. Asymptotic equipartition; convergence of log likelihoods to their expected values.
Reading: Feldman, "Information Theory" [PDF]
M.C. Hawking, "Entropy", from Fear of a Black Hole [lyrics; mp3 (radio-safe Brief History of Rhyme version)]
Ray and Charles Eames, A Communications Primer (via Britta Gustafson)
Lecture 6 (Thursday, 29 January): Statistical Inference for Discrete Random Sequences
Slides
Inference for Markov chains: the likelihood function; the maximum likelihood estimate of the transition matrix; its convergence; MLE for parameterized Markov chains. Error estimation and hypothesis testing for Markov chains. Parametric bootrstrapping. Higher-order Markov chains. Random functions of Markov chains; stochastic finite automata and their inference. Variable length Markov chains. CSSR.
Handout: Maximum Likelihood Estimation for Markov Chains
Readings :Guttorp, 2.7--2.9 and 2.12 (I, II, III); Smith, ch. 10; Foulkes, "A Class of Machines Which Determine the Statistical Structure of a Sequence of Characters" [PDF]; (*) Smith, "The Maintenance of Uncertainty" [PDF]
Lecture 5 (Tuesday, 27 January): Symbolic Dynamics
Slides
Discrete coarse-grainings of the continuous state space. Symbol sequences determined by initial conditions. The shift map; moving the complexity from the update rule to the space. Motivation: "Continuous math is hard; let's go discretize." Refinement of partitions under dynamics. Generating partitions; discretizing continuous dynamics without loss of information. Symbolic dynamics as discrete stochastic processes; the full logistic map as the Bernoulli process. Allowed and forbidden words. The topological entropy rate; how many things can your process do? Formal languages as tools for describing discrete sequences ("we have ways of making your dynamics talk"). Regular expressions and their grammars. Regular expressions and their machines ("circles and arrows and a paragraph on the back of each one"). Examples of languages and machines from the logistic map. Finite-type vs. sofic processes.
Handout: More on the topological entropy rate
Readings: Daw, Finney and Tracy, "A review of symbolic analysis of experimental data" [reprint]
Lecture 4 (Thursday, 22 January): Attractor Reconstruction and Prediction
Slides, lorenz-generator.pl
"Geometry from a time series": how "time-delay embedding" and Takens's theorem let us reconstruct attractors without knowing equations. An example with the Lorenz system. Attractor reconstruction as a form of inference. Change of coordinates, change of observables. Choice of reconstruction settings, false nearest neighbor method. Prediction in the reconstructed state space. Nearest neighbor prediction; an example with the Lorenz system. Picking predictive settings via cross-validation.
Side-notes: "Smooth change of coordinates"; Nonlinear predictors
Readings: Kantz and Schreiber, chs. 3--4; Smith, chs. 7--9.
Lecture 3 (Tuesday, 20 January): Attractors and Mixing
Slides and R
Attractors as generalizations of fixed points and limit cycles. Geometry of attractors: where do trajectories go? The Hénon map, our first multi-dimensionsional dynamical system. Multiple dimensions and memory. Stretching and folding again. Stable and unstable directions. The Lorenz equations, our first continuous-time dynamical system. Lyapunov exponents to measure stability and instability. Probability on attractors. Invariant distributions again. Subtleties of convergence to the invariant distribution. Mixing and decay of correlations. A central limit theorem for mixing processes. CLT behavior of the logistic map.
Readings: Flake, ch. 11; Miller and Page, chs. 1--3; Smith, chs. 4--6.
Lecture 2 (Thursday, 15 January): More dynamics
Slides and R
Stability of fixed points and cycles. Bifurcations; bifurcation diagrams; the period-doubling route to chaos; periodic windows within chaos. Devaney's definition of chaos; how it implies sensitive dependence; stretching and folding. The Arnold cat map. Intermittency. More ergodicity: the evolution of densities and the Frobenius-Perron operator; time averages; the world's simplest ergodic theorem.
Readings: Flake, ch. 10 and sec. 11.1; Guttorp, ch. 1; Smith, chs. 1--3.
The Arnold Cat Map Movie (starring Marlowe the Cat, directed by Evelyn Sander)
Lecture 1 (Tuesday, 13 January): Introduction to the course; starting with dynamics
Slides and R
Administrivia. What is a model? What is a simulation? What is a dynamical system? What is chaos? First acquaintance with the logistic map. Fixed points and cycles. Exploring some properties of chaos.

Corrupting the Young; Complexity; Enigmas of Chance

Posted by crshalizi at June 16, 2009 09:24 | permanent link

May 31, 2009

Books to Read While the Algae Grow in Your Fur, May 2009

Chelsea Cain, Heartsick and Sweetheart
Psychological thriller with psycho killers and local color for Portland. (How accurate the color is, I couldn't say.) Very well written, especially the characterization of the plucky-yet-self-destructive reporter, and the extremely creepy situation of the lead detective. There are some graphic scenes of torture, which honestly I skimmed through because I really couldn't take it. — Updated: The sequel is also good, but not, I think, quite as good. The explanation it gives for the central relationship makes sense, but I feel that relationship worked better in the previous book, where it was left mysterious (at least to me; maybe I'm slow).
Andrea Camilleri, August Heat
Jason Aaron and R. M. Guera, Scalped: Casino Boogie, Dead Mothers, The Gravel in Your Gut
Indian Country noir. Getting better, which is to say harder to read, as it goes. (Reading volume 1 would help.) — Sequel.
House of Mystery, vol. 1: Room and Boredom
Tales from the bar, a la Jorkens or the White Hart. Only the bar is in the dreamlands, or somewhere else between the worlds, and some of its regulars are actually trapped there, in a house one of them seem to have designed in her dreams... (I will be very upset, yet somewhat impressed, if they turn out to be winging the story, rather than taking it somewhere.)
Jamie McKelvie, Suburban Glamour
Neil Gaiman and Michael Zulli, The Last Temptation and The Facts in the Case of the Departure of Miss Finch
Faith Erin Hicks, Zombies Calling
The most charming, life-affirming, self-referential zombie movie ever; naturally, it's a comic book.
Severance
Shockingly smart, scary, and blackly funny, horror movie. (It reinforces every negative impression I have had about team-building exercises.)
Larry Samuelson, Evolutionary Games and Equilibrium Selection
How to figure out which equilibrium your game will end up at, under some not-too-laughably-implausible assumptions about individual decision-making, learning and imitation. I suspect that a lot of the results could be generalized to much broader classes of models, particularly in the large-population limit. (Cf. Kurtz below.)
Thomas G. Kurtz, Approximation of Population Processes
Like a kindergarten version of Ethier and Kurtz. This means that the reader is still expected to understand measure-theoretic probability and Markov processes pretty well, but not to necessarily care about the intricacies of the various Skorokhod topologies...
Sarah Graves, Dead Cat Bounce, Triple Witch, Wicked Fix, Repair to Her Grave
Brain-candy mysteries. Good for when one is lying in bed with the flu.
It's surely not an original observation, but there is a severe problem with setting a mystery series in a small town. These four books cover, if I recall correctly, two years of narrative time, featuring about three homicides a piece, in a town of under 2000 inhabitants, meaning the murder rate is about 3 per 1000 per year, which is 50 times the national mean, and almost half of the Lancet survey's point estimate of the 2003--2006 death rate from the invasion of Iraq. (Anyone who takes that as an invitation to try to drag me into the Lancet controversy will be ignored.) The only remotely plausible explanation is that the series' amateur sleuth is really a serial killer, and the novels are her elaborately-coded confessions. If anyone has approached the problem from this angle, I'd be interested in hearing about it. (The closest approach I can think of is Dexter.)
Sequels: 5, 6--8.
Jane Haddam, Living Witness
Haddam tackles Intelligent Design creationism, with her usual compelling characterization. (Some of the characters whose viewpoints the reader takes on are rather unpleasant people.) — It strikes me that Haddam has been writing this series for longer than some of my students have been alive, and I wonder whether (if it weren't for the marketing hook) the stories she's telling nowadays really need the continuity of the recurring detective...
(Minor continuity/background errors: here in Pennsylvania, we don't have "package stores", we have state-run wine and spirit stores; "Leibniz" is mis-spelled as "Liebniz"; in one passage the only number in Gregor's speed-dial is Bennis's, later the only two numbers are his voice-mail, then Bennis's. Obviously none of these matter at all.)
Justine Musk, Blood Angel
Brain-candy. — I believe this was a first novel, which would make sense of the fact that there's enough material in here for at least three books...

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Enigmas of Chance; The Dismal Science

Posted by crshalizi at May 31, 2009 23:59 | permanent link

April 30, 2009

Books to Read While the Algae Grow in Your Fur, April 2009

Jérôme Dedecker, Paul Doukhan, Gabriel Lang, José Rafael León R., Sana Louhichi and Clémentine Prieur, Weak Dependence: With Examples and Applications
As every school-child knows, a sequence of probability measures Dn converges "weakly" or "in distribution" on a limit D if Dnf converges on Df for every bounded and continuous test function f. In particular, if Dn is the joint distribution of two random variables from a time series separated by a time-lag n, with marginal distributions P and Q, then the series is asymptotically independent if Dn converges in distribution on the product measure P*Q.
(This is "weak" convergence because the Dn can look very different from the limiting D. A classic example: Dn puts probability 1/n on the points 1/n, 2/n, ... n/n = 1. This converges weakly to the uniform distribution on the unit interval [0,1], despite apparent obstacles like the latter giving probability 1 to the irrational numbers, while all the Dn give them probability 0.)
A large literature has grown up since the 1950s which concerns itself with properties of asymptotically-independent stochastic processes, usually measuring the approach to independence by means of various "mixing coefficients", along the lines introduced by Rosenblatt. As the name implies, if these coefficients go to zero asymptotically, then the process is strongly mixing. Having these mixing coefficients go to zero is very nice, since it implies that many of the nice properties of IID sequences (central limit theorems, PAC-learning results, etc.), carry over to the dependent-but-mixing sequences. Unfortunately, strong mixing is a very strong property, which is hard to prove and in many cases is known to fail — even when there is asymptotic independence!
This book, which summarizes and extends a series of recent papers by the authors (in various combinations) is about proving limit theorems — laws of large numbers, central limit theorems, empirical-process convergence, etc. — for asymptotically-independent but non-mixing processes, which the authors call "weakly dependent". Their theorems in chs. 6--10 can, in large part, be seen as instances of the trick that I learned as "blocking", and which they attribute to S. N. Bernstein. To calculate the time average of some function of a (stationary) process, for instance, divide the series up into a series of long blocks, plus padding or filler in between them. The global average is, trivially, equal to the average of the within-block averages, plus a remainder term for contributions from the filler. By stationarity, the blocks all have the same distribution, and by weak dependence the blocks are nearly independent — more nearly with longer fillers. So one can show that the global average will have the same behavior as it would in the IID case, if one can control (1) the corrections due to the remaining dependence among the blocks, and (2) the remainder due to the fillers between the blocks. Controlling (2) is fairly straightforward; the new stuff here comes from controlling (1), the residual dependence.
Following lines laid down in mixing theory, the authors tackle (1) by constructing deviation inequalities, bounds on the probability that sample values differ from expectations by more than a certain amount. Typically, these inequalities (chs. 4 and 5) have a more-or-less combinatorial flavor, though they also use coupling, and a lot of manipulation of cumulants (without making the connection to large deviations theory via cumulant generating functions). — As usual with such inequalities, it's rarely clear whether the various factors of 33 (or whatever) are for real, or just the best one can do with the current method of proof; but they suffice for many purposes, and I suppose it's better to have the constants be explicit than to bury them inside O() or o().
In mixing theory, the analogous bounds are stated in terms of the mixing coefficients. Here the new bounds are stated in terms of a range of new dependence measures (ch. 2), all denoted by utterly un-mnenomic Greek letters (again following the mixing-theory convention). Some of them play directly off the definition of convergence in distribution in terms of expectations of test functions: how big can the difference between Dnf and P*Qf get, when f is restricted to some suitably well-behaved, normalized class of test functions? The other set of measures are "projective". The conditional expectation of a function f, given X1, X2, etc., is the projection of f on to the space of functions calculable from X1, X2, etc.; the unconditional expectation of f is the projection of f on to the space of constant functions, which are calculable from nothing. (By "calculable from" I mean "measurable in the minimal sigma-algebra of".) If f is independent of X, then its conditional expectation is still its projection on to the constants. The other class of dependence coefficients, therefore, looks at how far conditional expectations depart from constancy — either maximizing over some family of test-functions, or by making future values of the time series itself the function in question.
Applications to non-parametric regression, spectral-density estimation, and various econometric problems occupy chs. 11--13, and ch. 3 shows that many standard and some non-standard time series models are weakly dependent, with estimates of the dependence coefficients.
— The book shows definite signs of its patch-up job origins. The prose is disjointed and slightly repetitive. The notation, too, is not always consistent; sometimes the generic dependence measure is c, sometimes it's \mu, and it can switch back and forth from one line to the next. A bigger annoyance is terminology. What the authors (and, it must be said, many probabilists) call "mixing" is more properly strong mixing, i.e., convergence of joint distributions to product measure in the strong, not the weak, topology. (If the Adversary can pick any pair of events, and you can show that that pair become independent as the separation between them grows, you have proved ordinary mixing. If the Adversary gets to pick a different pair of events with each separation, and you can still show asymptotic independence, that's strong mixing.) In dynamical systems and ergodic theory, "mixing" unmodified refers to the weaker notion of mixing. (See for instance the detailed treatment in Lasota and Mackey.) Thus for someone of my background, the Bernoulli shift is one of the canonical mixing processes, but they give it as their first example of a non-mixing process! (Their proof, incidentally, is wrong, but easily fixed.) In fact, I'd say that the failure to connect what they have done to the work in dynamics on decay of correlations (e.g.) is a real lost opportunity for both fields — optimistically, a direction for future research.
In the end, however, these are all minor complaints. The authors have brought together a great deal of original and important work, as a result of which the classical probabilistic limit theorems can now be rigorously applied to a much wider range of stochastic models. It's cool stuff and I would be very happy to teach it. I strongly recommend the book to statisticians interested in inference for stochastic processes, learning theorists interested in dependent data, probabilists interested in new applications, and theoretical econometricians.
Warren Ellis and Paul Duffield, Freakangels, vol. 2
Jack Campbell, Lost Fleet, vol. 5: Relentless
Brain-candy, in the process of moving from the Anabasis to De Bello Civili. (Which does not mean it's a historical pastiche.) Earlier morsels: 1--3, 4.
David Freidel, Linda Schele and Joy Parker, Maya Cosmos: Three Thousand Years on the Shaman's Path
A detailed and intelligent, if very conjectural, exploration of ancient Maya cosmology, based on deciphering surviving inscriptions, artifacts, and extrapolation from the modern Maya. Marred by irritating and silly stabs at cultural relativism (mostly at the beginning and the end).
Peter D. Harrison, The Lords of Tikal: Rulers of an Ancient Maya City
History of the greatest Maya city from the viewpoint of its rulers (the only view we really have any access to), from the beginning to the end. Clear on the difference between what we actually know and what the Mayanists are merely guessing at. Very good on architecture.
Bruce Kitchens and Selim Tuncel, Finitary Measures for Subshifts of Finite Type and Sofic Systems
As you know, Bob, a shift dynamical system has a state-space consisting of infinite sequences (one or two sided), say x drawn from a finite alphabet. The time-evolution operator T, or shift, simply moves the sequence one step to the right, i.e., (Tx)n = xn+1. This moves all the complexity and distinctions between different systems into the set of initial conditions, and space of allowed sequences, rather than in the time-evolution rule. The full shift has all possible sequences among its initial conditions (and so the space never shrinks); sub-shifts restrict the set of initial conditions. Suppose that the allowed values of xn depend on xn-1, xn-2, xn-k, but not on earlier entries in x, and that this k is the same for all n and all sequences. Then we have a subshift of finite type, the type or order of the subshift being k. These are the symbolic-dynamical equivalents of Markov chains of order k. As with Markov chains, any higher-order subshift of finite type can be converted to a first-order subshift. (Define a new alphabet where each symbol is a block of length k from the original alphabet.) A sofic system can be defined in one of two equivalent ways. One introduces the notion of a follower set, the set of all one-sided infinite sequences which can succeed a given history; a subshift is sofic if it has only finitely many follower sets. Alternately, introduce the notion of factor maps — continuous functions from one sequence space to another which commute with the shift. A sofic system is the image of a subshift of finite type under a factor map. A strictly sofic system is one which is sofic, but not a subshift of finite type.
Said another way, for any sofic system there is a set of states, possibly hidden, where the current state determines what symbols can be seen next, and the current state plus the next symbol determines the next state. Sofic systems are (the languages generated by) deterministic finite automata.
All of this is at the purely topological, possible-or-not level. One can of course also add probability measures on sequences. A Markov measure is one under which the distribution of Xn depends on Xn-1, but not, given that, on previous symbols. (Similarly for higher-order chains.) In this monograph, Kitchens and Tuncel ask, what measures are to Markov measures as sofic systems are to subshifts of finite type? They call these measures finitary, and characterize them in several ways, including (1) measures where the number of follower distributions is finite, (2) factor-map images of Markov measures, (3) ones where the conditional distribution of the future is always independent of the remote past given a finite segment of the immediate past, though one of possibly variable, context-dependent length. (The last is what motivates the name "finitary".) They develop the theory of finitary measures in considerable detail, and in parallel to the usual theory of Markov measures, as that is found in ergodic theory and the thermodynamic formalism. Statistical aspects (the reconstruction of finitary measures from sample data) are not addressed — fortunately for me. Prior acquaintance with symbolic dynamics and the thermodynamic formalism is absolutely essential, and a lot of familiarity with manipulating semi-groups would help.
(Your best bet for obtaining this is directly from the publisher. However, the community would be better served by putting it on the arxiv.)
Misty Massey, Mad Kestrel
Brain-candy. More piracy and less romance than blurb implies.
John McGowan, Democracy's Children: Intellectuals and the Rise of Cultural Politics
Nowhere near as cohesive as the subtitle suggests. Rather it's a set of essays on the conditions, activities and aspirations of academic in the humanities (especially literature as such) in the US from the mid-1980s through the mid-1990s, along with some thoughts about how they ought to try relating to the larger society they're part of. (I am quite sympathetic to the latter.) I wish McGowan would write a book about "intellectuals and the rise of cultural politics"; I think it would be very interesting.
(His mid-1990s remarks on the Web are more amusing in retrospect than they would have been at the time, since he's gone from sniffing at it to guest-blogging for Bérubé. To be fair, the idea that English departments would need specialists in hypertext does seem rather dated — because we have all become specialists in hypertext.)
C. J. Sansom, Revelation
Michelle Sagara, Cast in Fury

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; Enigmas of Chance; Mathematics; The Progressive Forces; The Commonwealth of Letters

Posted by crshalizi at April 30, 2009 23:59 | permanent link

April 28, 2009

That Word Does Not Exist In Any Language

Attention conservation notice: 1500 words on a dispute at the intersection of paleography and computational linguistics, two areas in which I have no qualifications; also a sermon on a text from Lichtenberg: "We must not seek to abstract from the busts of the great Greeks and Romans rules for the visible form of genius as long as we cannot contrast them with Greek blockheads."

Over the weekend, I read Mark Liberman's post at Language Log about the new Rao et al. paper in Science, claiming to show information-theoretically that the symbols recovered on artifacts from the Harappan civilization in the Indus Valley are in fact a form of writing, as had long been supposed but was denied a few years ago by Farmer, Sproat and Witzel.

What Rao et al. claimed to show is that the sequences of Indus symbols possess information-theoretic properties which are distinctive of written language, as opposed to other symbols sequences, say ones which are completely random (IID, their "type 1 nonlinguistic") or completely deterministic (their "type 2 nonlinguistic"). Specifically, they examined the conditional entropy of sequence pairs (i.e., the entropy of the next symbol given the previous one). The claim is that the Indus symbols have the same pattern for their conditional entropy as writing systems, which is clearly distinguishable from non-linguistic symbol sequences by these means.

As someone who is very, very into information theory (especially conditional entropy), I was intrigued, but also very puzzled by Mark's account, from which it seemed that Rao et al. had a huge gap where the core of their paper should be. Actually reading the paper convinced me that Mark's account was correct, and there was a huge logical fallacy. I'll reproduce Figure 1A from the paper and explain what I mean.

Rao et al. worked with a corpus of Indus Valley inscriptions, which recognizes 417 distinct symbol types. This is their "Indus" line. The other language lines come from different corpora in the indicated language. In each corpus, they filtered out the less common symbols, and then fit a first-order Markov chain. (Transition probabilities were estimated with a smoothing estimator rather than straight maximum likelihood.) Then they calculated the conditional entropy of the chain, using the estimated transition probabilities and the observed symbol frequencies (rather than say the invariant distribution of the chain); that's the vertical axis. The horizontal axis shows how many symbol types were retained --- i.e., "100 tokens" means that only the 100 most common symbols in the corpus were kept, and the chain was fit to those sequences. (This is not explained in the paper but was made clear in later correspondence between Sproat and the authors.) There are two lines for English, depending on whether "token" was taken to mean "character" (differentiating upper and lower case) or to mean "word".

The bottom curve shows the estimated conditional entropy from a purely deterministic sequence; the actual conditional entropy is in fact zero, so I presume that the upward trend is an artifact of the smoothed transition probabilities. The top curve, on the other hand, is from a uniform IID sequence --- here the real conditional entropy is the same as the marginal entropy, but both grow as N increases because the size of the symbol set grows. (I.e., this is an artifact of keeping only the most common symbols.)

Here is the flaw: there is no demonstration that only linguistic sequences have this pattern in their conditional entropies. Rao et al. have shown that two really extreme non-linguistic processes don't, but that's not a proof or even a plausibility argument. I would settle for an argument that non-linguistic processes have to be really weird to show this pattern, but even that is lacking. In Mayo's terms, they have not shown that this test has any severity.

Of course the fact that they haven't shown their test is severe doesn't mean that it isn't, in fact, severe. So, by way of procrastinating, I spent some time yesterday constructing a counter-example. My starting point was what Mark had done, generating a sequence of IID draws from a geometric distribution (rather than a uniform one) and subjecting it to the same analysis as Rao et al. As it happens, I had already written a function in R to fit Markov chains and calculate their log-likelihood, and here the conditional entropy is the negative log likelihood over the sequence length. (Admittedly this is only true using the maximum likelihood estimates for transition probabilities, rather than smoothed estimates as Rao et al. do, but my simulations had so much data this shouldn't matter.) Setting the rate of the geometric distribution to 0.075, here were my first results.

Mark Liberman and Richard Sproat did almost the same thing pretty much simultaneously, as you can see from the updates to Mark's post.

This was not entirely satisfactory, since (as Rao et al. point out in the online supplementary materials), there is a big gap between the marginal and conditional entropies for writing and for the Indus symbols. This was also, however, not too hard to finesse. In addition to the geometric sequence, I generated a Markov chain which alternated between the values +1 and -1, but where each positive or negative sign was 99% likely to be followed by the same sign. (That is, the signs were highly persistent.) I then multiplied the IID geometric variables (plus 1) by the Markov signs. This gave a larger set of symbols, but where knowing the sign of the current symbol (which "register" or "sub-vocabulary" it came from) was quite informative about the sign of the next symbol. (I added 1 to the geometric variables to exclude 0=-0, keeping the sub-vocabularies distinct.)


Pluses: marginal entropy; circles: conditional entropy

A third experiment takes after the fact that the Indus symbol sequences are all extremely short, at most a dozen characters or so. In stead of having a Markov chain for the sign, I used another, independent set of random draws, uniform on the integers from 2 to 6, to divide the sequence into blocks, and gave all the symbols in each block the same (coin-toss) sign.


Pluses: marginal entropy; circles: conditional entropy

(Because I'm doing everything with a single long sequence, I artificially introduce transitions from positive to negative signs, which lowers the gap between the conditional and unconditional entropies. If I wanted to do this properly, I'd re-write my Markov estimator so it used many short sequences; but that would be too much like real work.)

The mechanism producing the gap between conditional and unconditional entropies is that the marginal distribution of symbols is a mixture of several pure distributions, and which mixture component we draw from now influences which component we'll draw from next (so the sequence can be Markov, exchangeable, etc.). Given the mixture components, the symbols are independent and the conditional and unconditional entropies are equal. Without that knowledge, the first symbol in effect is a cue for figuring out the mixture component, reducing the entropy of the second. There is nothing specifically linguistic about this; any hidden Markov model does as much. It would, for instance, work if the "symbols" were characters in randomly-selected comic books, who cluster (though slightly imperfectly); if that's too low-brow, think about Renaissance paintings, and the odds of seeing St. John the Baptist as opposed to a swan in one which contains Leda.

I have made no attempt to match the quantitative details Rao et al. report for the Indus symbols, just the qualitative patterns. Were I to set out seriously to do so, I'd get rid of the geometric distribution, and instead use a hidden Markov model with more than two states, each state having a distinct output alphabet, the distribution of which would be a Zipf (as used by Liberman or Sproat) or a Yule-Simon. (I might also try block-exchangeable sequences, as in my third example.) But this would approach real work, rather than a few hours of procrastination, and I think the point is made. Perhaps the specific results Rao et al. report can only be replicated by making distributional assumptions which are very implausible for anything except language, but I'd say that the burden of proof is on them. If, for instance, they analyzed lots of real-world non-linguistic symbol systems (like my comic books) and showed that all of them had very different conditional entropy curves than did actual writing, that would be a start.

I should in the interest of full disclosure say that a number of years ago Farmer and I corresponded about his work on the development of pre-modern cosmologies, which I find interesting and plausible (though very conjectural). But if anything I hope Farmer et al. are wrong and the Indus Valley civilization was literate, and I'd be extra pleased if the language were related to Tamil. Unfortunately, if that's the case it will need to be shown some other way, because these conditional entropy calculations have, so far as I can see, no relevance to the question at all.

My code (in R) is here if you want to play with this, or check if I'm doing something stupid. In the unlikely event you want more, I suggest reading the reply of Farmer et al., Rahul Siddharthan (especially the comments), or Fernando Pereira; the last is probably the wisest.

Manual trackback: Metadatta; Language Hat

Enigmas of Chance; Writing for Antiquity

Posted by crshalizi at April 28, 2009 22:30 | permanent link

April 10, 2009

Next Week at the Statistics Seminar: Bayes, Bayes, Baked Beans, Sausage and Bayes

Next week at the CMU statistics seminar, we give you all the Bayes you could want and more:

Tom Griffiths, "Connecting human and machine learning via probabilistic models of cognition"
Abstract: Human performance defines the standard that machine learning systems aspire to in many areas, such as forming new concepts, making scientific discoveries, and learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the mind works. However, in order for ideas to flow easily from psychology to computer science and vice versa, we need a common language for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a language, including results that illustrate how novel ideas from statistics can inform cognitive science.
Date and place: Monday, 13 April 2009, 4-5 pm in Porter Hall 125C
Peter Orbanz, "Conjugate Projective Limits"
Abstract: Bayesian nonparametric models are essentially Bayesian models on infinite-dimensional spaces. Most work along these lines in statistics focusses on probability models over the simplex. In machine learning, the problem has recently received much attention as well, and attempts have been made to define models on a wider range of infinite-dimensional objects, including measures, functions, and infinite permutations and graphs.
In my talk, I will discuss the construction of nonparametric Bayesian models from finite-dimensional Bayes equations, roughly analogous to the Kolmogorov extension of systems of measures to their projective limits. I will present an extension theorem applicable to regular conditional probabilities. This can be used to study whether "conditional" properties of the finite-dimensional marginal models, such as conjugacy and sufficiency, carry over to the infinite-dimensional projective limit model, and to determine the functional form of the nonparametric Bayesian posterior if the model is conjugate.
Date and Place: Friday, 17 April 2009, 1-2 pm, Porter Hall A18A
Both talks are free and open to the public.

Enigmas of Chance; Minds, Brains, and Neurons

Posted by crshalizi at April 10, 2009 13:08 | permanent link

In Another Green World

I will be completely offline from the 11th to 20th, while I contemplate whether certain referees deserve to be fed to jaguars, or whether it would not be more humane to sacrifice them to the great feathered serpent. (I mean humane to the cats, of course.)

Self-centered

Posted by crshalizi at April 10, 2009 13:00 | permanent link

April 03, 2009

Next Week at the Statistics Seminar: "Methods and Models for Time-Dependent Relational Data"

Next week at the CMU statistics seminar:

Andrew C. Thomas, "Methods and Models for Time-Dependent Relational Data"
Abstract: In recent years there has been a proliferation of relational data, including studies of social networks, where outcomes are measured over time. Methods that use binary outcomes traditionally rely on Markov chain assumptions that indicate the generation, maintenance and severing of network ties, often relying on dichotomized perceptions of relationships. I present a hierarchical latent variable model for time-dependent relations that makes use of dynamic programming techniques for its solution, and discuss the expansion of this model to non-binary ties, with several real and synthetic examples of social and professional interactions.
Date and place: Monday, 6 April 2009, 4-5 pm in Porter Hall 125C
Free and open to the public.

Networks; Enigmas of Chance

Posted by crshalizi at April 03, 2009 13:37 | permanent link

March 31, 2009

Books to Read While the Algae Grow in Your Fur, March 2009

Diana Pharaoh Francis, The Cipher and The Black Ship
Fantasy brain-candy. The first book does not, despite the title, involve cryptography.
Fall of Cthulhu, vol. 4, Godwar
Pointless if you have not been following along (1, 2, 3).
Carrie Vaughn, Kitty Raises Hell
Patricia Briggs, Bone Crossed
Dean Baker, Plunder and Blunder: The Rise and Fall of the Bubble Economy
The best thing I've read on the current crisis: short, plainly written, and totally accurate. (You can get a sense of its contents here.)
Robert Sharer with Loa Traxler, The Ancient Maya, 6th edition
Massive (~800 pp.) textbook on Maya archaeology, supplemented with ethnohistory and ethnography. Covers the whole period from first settlement through the Spanish conquests, though naturally emphasizing the Classic period (+250 to +900 or +1100, depending on where you are).
Taylor Anderson, Into the Storm, Crusade and Maelstrom
More enjoyable than a trilogy of military SF novels which could be summarized as "what these lemurs need is a boatload of vintage honkeys" has any right being.
Felix Gilman, Thunderer: A Novel of High Fantasy
The city itself as the enchanted realm, with lost, mad and exploited gods, airships, music, feral children, and philosophes writing an encyclopedia. (He realizes that the ecology makes no sense.)

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; The Dismal Science; The Progressive Forces; The Continuing Crises; Writing for Antiquity; Cthulhiana

Posted by crshalizi at March 31, 2009 23:59 | permanent link

March 27, 2009

Another Idle Question

How many of the people currently pushing or exploiting conspiracy theories about the introduction of a global currency also claim to support returning to the gold standard?

(And where's Dan Sperber when we need him?)

The Dismal Science; The Running Dogs of Reaction; Psychoceramics

Posted by crshalizi at March 27, 2009 21:10 | permanent link

March 26, 2009

Some Bayesian Finger-Puzzle Exercises, or: Often Wrong, Never In Doubt

Attention conservation notice: Clearing out my drafts folder. 600+ words on some examples that I cut from a recent manuscript. Only of interest to (bored) statisticians.

The theme here is to construct some simple yet pointed examples where Bayesian inference goes wrong, though the data-generating processes are well-behaved, and the priors look harmless enough. In reality, however, there is no such thing as an prior without bias, and in these examples the bias is so strong that Bayesian learning reaches absurd conclusions.

Example 1

The data Xi, i=1,2,3,..., come from a 50/50 mixture of two Gaussians, with means at -1 and +1, both with standard deviation 1. (They are independent and identically distributed.) The prior, by coincidence, is a 50/50 mix of two Gaussians, located at -1 and +1, both with standard deviation 1. So initially the posterior predictive distribution coincides exactly with the actual data-generating distribution. After n observations x1, ... xn, whose sum is z, the log-likelihood ratio L(+1)/L(-1) is e2z. Hence the posterior probability that the expectation is +1 is 1/(1+e-2z), and the posterior probability that the expectation is -1 is 1/(1+e2z). The sufficient statistic z itself follows an unbiased random walk, meaning that as n grows it tends to get further and further away from the origin, with a typical size growing roughly like n1/2. It does keep returning to the origin, at intervals dictated by the arc sine law, but it spends more and more of its time very far away from it. The posterior estimate of the mean thus wanders from being close to +1 to being close to -1 and back erratically, hardly ever spending time near zero, even though (from the law of large numbers) the sample mean converges to zero.

This figure shows typical sample paths for z, for the posterior probability of the +1 mode, and for the relative entropy of the predictive distribution from the data-generating distribution. (The latter is calculated by Monte Carlo since I've forgotten how to integrate, so some of the fuzziness is MC noise.) Here is the R code.

click for full-size PDF

Exercise 1: Confirm those calculations for the likelihood ratio and so for the posterior.

Exercise 2: Find the expected log-likelihood of an arbitrary-mean unit-variance Gaussian under this data-generating distribution.

Example 2

Keep the same data-generating distribution, but now let the prior be the conjugate prior for a Gaussian, namely another Gaussian, centered at zero. The posterior is then another Gaussian, which is a function of the sample mean, since the latter is a sufficient statistic for the problem.

Exercise 3: Find the mean and variance of the posterior distribution as functions of the sample mean. (You could look them up, but that would be cheating.)

As we get more and more data, the sample mean of converges almost surely to zero (by the law of large numbers), which here drives the mean and variance of the posterior to zero almost surely as well. In other words, the Bayesian becomes dogmatically certain that the data are distributed according to a standard Gaussian with mean 0 and variance 1. This is so even though the sample variance almost surely converges to the true variance, which is 2. This Bayesian, then, is certain that the data are really not that variable, and any time now will start settling down.

Exercise 4: Suppose that we take the prior from the previous example, set it to 0 on the interval [-1,+1], and increase the prior everywhere else by a constant factor to keep it normalized. Show that the posterior density at every point except -1 and +1 will go to zero. (Hint: use exercise 2 and see here.)

Update in response to e-mails, 27 March: No, I'm not saying that actual Bayesian statisticians are this dumb. A sensible practitioner would, as Andy Gelman always recommends, run a posterior predictive check, and discover that his estimated model looks nothing at all like the data. But that sort of thing is completely outside the formal apparatus of Bayesian inference. What amuses me in these examples is that the formal machinery becomes so certain while being so wrong, while starting from the right answer (and this while Theorem 5 from my paper still applies!). See the second post by Brad DeLong, linked to below.

Manual trackback: Brad DeLong; and again Brad DeLong (with a simpler version of example 1!); The Statistical Mechanic

Enigmas of Chance

Posted by crshalizi at March 26, 2009 10:45 | permanent link

March 24, 2009

Where Did the Steelworkers Go?

Attention conservation notice: back-of-the-envelope calculations about why the US has only about a fifth as many steelworkers now as it did in 1960. Not backed by any actual knowledge of the steel industry. Utterly untimely, it was, I think, prompted by a comment thread on Unfogged, but so long ago I can't remember which.

In 1960, US primary steel production was 91 million tons, of which 2.95 million tons were exported; it also imported 3.24 million tons. This part of the industry employed 530,000 people in all capacities, for an annual output of 170 tons/employee.

In 2007, US primary steel production was 98.1 million tons, with exports of 10.1 million tons and imports of 30.2 million tons. Employment was only 97,540 people, coming to 1005 tons/employee.

Exports and imports in 1960 were a wash, nearly enough, so let's suppose trade patterns had remained comparable and say that all of the net imports were to be made up by higher domestic production: (20.1 million tons)/(1005 tons/worker) = 20,000 extra workers. This would be a substantial increase, but it would still leave employment in steel at only 22% of its 1960 level. Where did the other four-fifths of the industry go?

The most obvious explanation is productivity. The industry in 2007 produced more than it did in 1960, with many fewer employees. In fact, output per employee grew 5.9 times over that period. A six-fold increase in productivity divided by a slight rise in total demand equals a roughly five-fold fall in employment.

Now, this calculation understates the effect of trade because it only considers net imports of steel. But steel is used as an input to producing many other things, and a washing machine made of steel shows up in this sort of official statistic as an import of a manufactured good, not an import of steel. So to really see what US steel production would be if we retained 1960 trade patterns, we'd need to see what the change in the (foreign*) steel content of US net imports has been. Since I don't have Leontief input-output matrices for the US and its trading partners in the two years, I can't do this.

Failing actual knowledge, I'll turn to guesswork. Suppose the steel content of imports was equal to net direct imports; this seems high, but what do I know? This would just add another 20,000 jobs, and bring us up to 26% of the size of the industry in 1960. To get the same level of employment in steel production now as in 1960, the net increase in the foreign steel content of our imports would have to satisfy

(530,000 workers) - (117,540 workers for domestic production and direct imports) = (increase in net indirect imports)/(1005 tons/worker)
or 414,522,300 tons, i.e., about 3.5 times total production plus net direct imports. This is highly implausible.

I conclude that domestic employment in steel production has collapsed largely because increases in productivity have not been matched by increases in demand. If someone can point out where this reasoning goes wrong, I'd appreciate it.

*: Foreign steel content, because if the washing machine is made abroad of steel exported by the US, replacing that washing machine with a US-made one will not increase the demand for American steel.

Sources: 2007 employment figure from BLS (NAICS code 3311). 1960 employment figure from Table 1 on p. 2 of Lebergott. (It does not, however, appear to be affected by some of the well-known problems with Lebergott's series for the 1930s.) Annual production, import and export figures from USGS.

Manual trackback: The Inverse Square Blog; Nothing Funny About Feldspar (with more facts; go read)

The Dismal Science

Posted by crshalizi at March 24, 2009 10:49 | permanent link

March 22, 2009

Special Function Invocation

O Hive Mind, o Lazy Web, Urania's child, I invoke thee! Is there a name for the function
\[ 
f_n(\theta) = \sum_{k=0}^{n}{{n \choose k} \theta^k {(1-\theta)}^{n-k} \log{k!}} 
 \]
i.e., for $ \mathbb{E}[\log{X!}] $ when X is binomially distributed?

Enigmas of Chance

Posted by crshalizi at March 22, 2009 21:52 | permanent link

March 17, 2009

Idle Question of the Day

Exactly what bad consequences would follow if laws were passed by the relevant countries rendering credit default swap contracts void henceforth? (That is, canceling all the outstanding wagers because the bookies went bust.)

Update, 22 March: Well, one bad consequence would evidently be agreeing with Ben Stein. A bit from that link (by Felix Salmon, not Stein) is worth quoting:

There's a good chance, just for starters, that every major bank in America would go bust overnight: after all, they've been packaging up and selling off the credit risk on their multi-trillion-dollar loan portfolios [for years]. If Stein got his way, all that credit risk would suddenly reappear on the banks' balance sheets, and there's nothing they could do about it. Genius. Remember that those super-senior CDOs were the safest bits of the credit that they sold off. Just imagine what their balance sheets would look like if all the risky bits reappeared.

The issue he's raising is that if the banks can't say that they're covered for the risk of their loans defaulting (via the credit default swaps), they need to hold more capital as a protection against default. So as a legal or regulatory issue, ending the swaps would make the banks worse off. Substantively, however, this only makes sense if the swaps would, in fact, protect banks in the event of defaults — if they actually shifted the risk to the swap-sellers. Since we have just had pretty dramatic demonstrations that this is not something to be counted on, it's not at all clear to me that the banks ought to be able to keep that risk off their balance sheets. (In other words, the real value of the swaps to the banks is zero, or next to zero.) In any case, this objection could be countered by combining ending credit default swaps with public guarantees of the banks' existing positions — which is effectively what's happening anyway, only without making it harder to repeat the mistake in the future.

More broadly, ending credit default swaps would mean that those who sold such swaps would lose their stream of payments (a flow) but gain back their collateral and reserves (a stock); conversely buyers of default protection would gain a cash flow but take a hit to their capital stocks. Right now one imagines that even those selling the "end of the world trade" might prefer to get out of the game; I'd be interested to see an estimate of the effects of this on the stability of the financial world right now.

There is also the possibility that eliminating the swaps would deprive us of information about how risky different debts are. The value we should place on this, however, depends on how well these markets actually succeed in aggregating information about risk. I'd say there is abundant cause for skepticism about this — especially when things are, in fact, dangerous. Economic theory does not, in fact, provide any reason to think that such markets will be dominated by those with the most accurate beliefs (or even that the market as a whole will be more accurate than the best-informed trader), unless you assume a complete set of markets, which is a reductio ad absurdum if ever there were one. (When markets are incomplete, more markets are not necessarily better.)

To be clear, I am not asserting that credit default swaps should be ended. I honestly don't think I know enough to have an opinion about that, and while I'm obviously skeptical about their value, some serious and credible people (i.e., ones who do not have a vested interest in the matter) who've studied them in more depth see merit to them. If this is what a world with efficiently-allocated risk looks like, though, I'd hate to see a messed-up one.

(Thanks to readers D.H. and son1 for comments and pointers.)

The Dismal Science; The Continuing Crises; Modest Proposals

Posted by crshalizi at March 17, 2009 22:22 | permanent link

February 28, 2009

Books to Read While the Algae Grow in Your Fur, February 2009

Paul Krugman, The Return of Depression Economics and the Crisis of 2008
How can Paul Krugman get away with re-issuing a book he wrote ten years ago as being about the current financial meltdown? Because the problems are the same: nobody listened. (More exactly, our madmen in authority didn't listen.) The new chapter on the US "shadow banking system" and its collapse is especially good.
Steven N. Durlauf and H. Peyton Young (eds.), Social Dynamics
Economists attempting to assimilate non-market, non-"rational" (but adaptive) social interactions. Highlights: Blume and Durlauf's chapter on the statistical mechanics of adaptive games; Young's summary of his book on institutions; Bowles's summary of his work on group selection and the evolution of preferences (expanded on in his book (which everyone should read); Axtell, Epstein and Young's model of how invidious, inefficient, self-sustaining social distinctions can arise endogenously and persist for immensely long times, even though "in the long run" egalitarian conventions are more stable.
There are two chapters on the problems with actually detecting social interactions from survey data. Scheinkman and Glaeser simply assume certain forms for interactions a priori (unhelpfully cast in the form of utility functions, when only the behavioral decision rules matter), postulate an absurd topology for interactions, observe that the variance of aggregates (e.g., cities) will scale one way with the size of the aggregates if people make decisions independently, but another way if there are interactions, and proceed merrily to fit and estimate coefficients. (At no point do they do any specification-checking, at least not that I can see.) Moffit considers many of the ways in which strictly linear social-interactions models with unique equilibria can fail to be identifiable --- though why that class of model should be thought plausible, I couldn't tell you. (It would have been interesting to read these authors' reactions to each others' papers.)
The final chapter, by Binmore on social contracts, did nothing for me.
Disclaimer: Axtell, Blume, Bowles, Durlauf, Epstein, Young and I are all affiliated with SFI.
Umbrella Academy: Apocalypse Suite; Proof: The Company of Men; The Walking Dead, book I
Various flavors of comic-book mind-candy.
Simon Oliver and Tony Moore, The Exterminators, vol. 5: Bug Brothers Forever
A fitting conclusion to the saga; but it will only make sense if you've read the previous parts [1, 2--4].
Alan Moore, Steve R. Bissette and John Totleben, The Saga of the Swamp Thing, vol. 1
Supposedly a classic among graphic novels. I read it in high school (loaned to me by a teacher!), but didn't really remember anything. On re-encountering it at my local avatar of The Android's Dungeon... the story-telling is decent enough (but: planarian worms? wasn't that pretty retro even then?), but it doesn't leave me wanting to see what happens next. And the art-work is ugly. Does it get better?
Warren Ellis and Paul Duffield, Freakangels, vol. I
Collects the beginning of the webcomic about life in London after the end of the world.
Paul R. Krugman, The Self-Organizing Economy
I read this, and reviewed it, twelve years ago (where does the time go?), but re-read it in preparation for teaching some of this material. I see no reason for altering my review.
John Tyler Bonner, The Social Amoebae: The Biology of Cellular Slime Molds
The cellular slime molds are exceedingly strange creatures. For most of their life-cycle, they are single-celled amoebae, crawling through the soil eating bacteria and reproducing by fission. When stressed by lack of food, however, they spontaneously aggregate, in a quite freaky-looking self-organizing process, which is actually an excitable medium. The result is a "grex" or "slug", which is to all appearances a differentiated, multi-cellular organisms that responds to stimuli (light, heat, chemical gradients, etc.), and crawls up through the soil and then uphill. Having rooted itself in place it further differentiates into a base or stalk, where all the cells die, and a fruit body full of spores, generally dispersed by insects. (They also have a version of sex, which is too weird for me to describe here.) They obviously raise a lot of questions: How do they do that? Why do that do that? How did they evolve to do that? Is that how our multicellularity evolved? etc.
There are many species of these oddities, but the most commonly studied one, on its way to being a standard model organism, is Dictyostelium discoideum. Bonner is about the second scientist in the world to study Dictyostelium (it was discovered by a previous graduate student of his thesis adviser), and has been plugging away at it for sixty years now. This is his second book on the subject (his old monograph is long out of print, and anyway came just before his lab made some crucial discoveries), intended as a synthesis of what we know about the cellular slime molds in general, and Dictyostelium in particular. The result is a slim but comprehensive, well-written and thought-provoking book from an old master, which I strongly recommend to anyone interested in evolution or development, or indeed in self-organization.
Lois McMaster Bujold, The Sharing Knife, vol. 4: Horizon
Jeff Linsday, Dearly Devoted Dexter vs. Dexter: Season 2
Words I never thought I'd write: the TV show is better. In particular it does a much better job of creating and sustaining other characters. (Such as the manic pixie dream girl from hell.) I realize that from Dexter's point of view, everybody else is a one-dimensional obstacle-or-resource, but that quickly gets old.
Carrie Vaughn, Kitty and the Dead Man's Hand
Mind-candy; makes me very glad we did not elope to Vegas. Previous installments [1, 2, 3, 4] probably not absolutely necessary but would definitely help.
Read this month but not exactly recommended:
Charles P. Kindleberger, Manias, Panics and Crashes: A History of Financial Crises
This comes highly recommended from usually reliable authorities (Samuelson and Solow most notably), but, at least in the edition I read (the 5th, of 2005, revised by Robert Aliber), it's a disappointing. There is a lot of information in it, and the basic story about the credit cycle is sound, but the writing is fragmented, disjointed and repetitive. More fundamentally, it tries to be an analysis, or even an anatomy, of the recurring parts and phases of crises, and it tries to illustrate this by recounting the appropriate sections of the histories of many crises in each chapter, without giving these tales any kind of narrative or historical context. (For instance, I don't think he ever explains what the South Sea Bubble was, or John Law's land-company bank in France, despite many consequential references to both.) This might work if you already know the stories of all the major and many of the minor financial crises of the capitalist core from the 18th century until now, but not otherwise.
George Cooper, The Origin of Financial Crises: Central Banks, Credit Bubbles and the Efficient Market Fallacy
A good exposition of credit cycle ideas and well-deserved mockery of the efficient market hypothesis, melded with a very unfair take on actual economics (apparently books and papers like these do not exist), and some truly weird and regressive Friedmanite policy proposals. (Not all his policy ideas are bad, but he doesn't seem to realize that "keep the money supply stable" isn't exactly a new idea.) The invocation of "Maxwell and Mandelbrot" for, respectively, control theory and heavy-tailed distributions is pretentious, and if I took it seriously (as invited by the fact that he reprints Maxwell's paper "On Governors"!), dumb. (For example: Maxwell's remark about how there are only four possible kinds of responses to perturbations only applies to linear systems.) But it would be more charitable to regard these bits as ill-advised rhetorical flourishes.
(Thanks to John Burke for pointing out typos.)

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Complexity; The Dismal Science; Biology; The Progressive Forces; The Continuing Crises; Incestuous Amplification

Posted by crshalizi at February 28, 2009 23:59 | permanent link

February 14, 2009

On Tao's Structure and Randomness

My review of the book of Terence Tao's blog now out in the March-April 2009 issue of American Scientist. I wrote about Tao's blog a while ago, but there's a lot more in the review. Hopefully at least some of it is comprehensible.

— Having spent over fifteen years reading the magazine, especially for its book reviews, I am happy to finally have something printed there, but worry it means their standards are slipping.

Manual trackback: The Browser

Mathematics; Self-Centered

Posted by crshalizi at February 14, 2009 22:48 | permanent link

January 31, 2009

Books to Read While the Algae Grow in Your Fur, January 2009

Charles Tilly, Trust and Rule
Tilly's principal book on "trust networks", how they sustain themselves or fail to do so, and how they relate to states and other forms of political power. Trust networks "consists of ramified interpersonal connections, consisting mainly of strong ties, within which people set valued, consequential, long-term resources and enterprises at risk to the malfeasance, mistakes, or failures of others" (p. 12). This defines trust as a relationship, one of exposing oneself to serious risk through another's malice or mistakes, a definition which is pointedly silent about the feelings accompanying the relationship. I think this nails it.
Tilly goes on (p. 13, italics his):
Most networks support little or no trust. We will sometimes recognize segments of networks that qualify as trust-connected cliques. But the networks of drug use, blood distribution, and sexual contact through which HIV spreads, the networks through which routine political information flows, and the networks established by shared membership in voluntary associations mostly do not qualify. More generally, single-stranded networks containing few triads and sustaining little intimacy among their nodes rarely or never become trust networks.

Characteristic enterprises in which trust networks figure importantly include cohabitation, procreation, provision for children, transmission of property, communication with supernatural forces, joint control of agricultural resources, long-distance trade, protection from [human] predators, maintenance of health, and collective response to disaster. With marked variation from setting to setting, trust networks often take the forms of religious sects and solidarities, lineages, trade diasporas, patron-client chains, credit networks, mutual aid societies, age grades, and local communities.

This is all very good, and I also like that Tilly does not romanticize trust networks, being explicit that terrorist cells, pirates, Russian mobsters, etc., all qualify, and that (as these examples suggest) the risky undertaking enabled by a trust network can be preying on others. Even without that: "Powerful figures within trust networks sometimes tyrannize their members: instill strange beliefs in them, put them through painful initiations, force youngsters into distasteful careers, require shows of respect for unworthy elders, murder young women who challenge their sexual or marital prescriptions. By no means does membership in a trust network guarantee happiness, much less freedom." Still, he convinces the reader, or at least me, that trust networks are enduring and important parts of society. He also offers some intriguing generalizations, like the one about the importance of triads in the network graph. (I think, but I don't recall him ever quite saying this explicitly, this is because triads make monitoring and reputation possible.) There are, as usual, many excellently-presented historical cases, spanning the globe and the centuries.
Nonetheless, I find myself less than fully satisfied. (1) Nobody except Tilly talks about "trust networks", at least not in this sense, and we rarely have historical cases where we can identify them with any precision. So there is a lot of guesswork here. (2) Tilly's stories about the kinds of mechanisms at work sound plausible, as usual, but I despair of ever being able to test them. (3) He offers no guidance about when we should expect different mechanisms to be engaged. Perhaps, to be fair, no guidance can be offered at this level — perhaps it always depends on high-precision historical details. (More minorly: [4], Tilly sometimes, as on p. 81, insists that the ties linking members of a trust network must be a relationship for which the participants have a name: why? I don't even think all his examples meet this criterion. [5] This already-brief book would have been ever shorter if he didn't repeat his definitions of terms over and over.) I can't help but think that Tilly's dislike of game theory may have been a liability.
It's interesting to think about science in terms of trust networks. Scientific collaboration is placing "valued, consequential, long-term resources and enterprises" — viz., the scientist's reputation and career — "at risk to the malfeasance, mistakes, or failures" of the scientist's collaborators. Might this go some way towards explaining features of scientific collaboration networks, like the very high density of triads, and persistent collaborative cliques? As for "strange beliefs", "painful initiations", "distasteful careers", and "deference to unworthy elders" (but fortunately not murder), the jokes draw themselves.
Dave Lapham, Silverfish
Noir crime-fiction in comic-book form, with teenage protagonists. Well-told and well-drawn.
Lois McMaster Bujold, The Sharing Knife, vol. 3, Passage
Mathukumalli Vidyasagar, Learning and Generalization: With Applications to Neural Networks
A very nice textbook on statistical learning theory (a la Vapnik) which, properly, treats it as a branch or extension of empirical process theory, and emphasizes function learning (instead of just classifier learning). Among the many nice features here is the recognition that data in the real world are dependent, and a discussion of conditions under which learning procedures designed for independent data will still work with dependent data, albeit with an efficiency penalty reflecting how quickly correlations decay. (Beta-mixing, for example, is sufficient but not necessary; an interesting open question is what the necessary and sufficient conditions on a mixing process are for probably-approximately-correct learning to remain possible.) Vidyasagar is also good at building connections to computational learning theory, which introduces considerations of time- and sample- complexity.
No prior knowledge of learning theory or even of measure-theoretic probability is required; all the necessary mathematical material is built here. Basic mathematical maturity, of the kind one would expect from graduate students in statistics, computer science, electrical engineering, physics, economics, etc., is essential.
The last two chapters consider, respectively, neural networks and problems in control theory. (Despite the back-cover blurb, support vector machines are discussed on only one page.) The neural network chapter is fairly self-contained, but the control chapter will be largely incomprehensible to those without previous exposure to the subject. This is a shame, since it contains about randomized algorithms for probably-approximately-correct solutions to intractable problems, and about systems identification, which would be of interest to readers whose eyes will glaze over (to say the least) at the sight of transfers functions. This is, however, at most a minor flaw.
If I were teaching a class on statistical learning theory, I would definitely consider using this book.
(Thanks to Dr. Vidyasagar for some interesting correspondence, which prompted me to read his book.)
James K. Galbraith, The Predator State: How Conservatives Abandoned the Free Market and Why Liberals Should Too
I really have nothing to important to add to Aaron Swartz's summary; other than to say read this book. (I am buying copies for friends and relatives.)
(On an un-important note, I think it's inevitable it is unfair but inevitable to compare this J. K. Galbraith writing about economics and public purposes to the other one, who happens to have been his father. [Or it's at least inevitable that I'd make the comparison, since the elder Galbraith is one of my heroes.] This book is in some ways an act of filial piety, losing few opportunities to point out places where the senior Galbraith has been vindicated by events. More broadly, its great theme is the collapse of what JKG I called "countervailing power", the thing which made American capitalism tolerable and even progressive — or, more exactly, the deliberate destruction of such countervailing power. For the most part this Galbraith avoids his father's style, — smooth as silk, and as hard to produce — in favor of more workmanlike prose; there are a few places, but only a few, where he is positively infelicitous, in ways his father would never have allowed into print.)
Dan Simmons, The Terror
Historical horror fiction, based on the Franklin expedition in search of the northwest passage of the 1840s. Some recurring Simmons themes (e.g., the characters with the unlikely fondness for classical Greek (who he should not have had discuss natural selection; I can't decide whether this is a greater offense against historical plausibility or against sheer narrative flow), and the contrast between adapted indigenous cultures (here, the "Esqimaux") and blundering, greedy, self-defeating westerners, though he doesn't hit the reader over the head with that last quite as bluntly as in his master-work Hyperion. (Oddly, non-western high civilizations come off very poorly in Simmons's fiction, as in the brilliant and terrifying yet borderline-racist Song of Kali.) Creepy and intensely compelling.
Steven Johnson, The Invention of Air: A History of Science, Faith, Revolution, and the Birth of America
Popular biography of Joseph Priestley. Johnson tries hard to keep in view both Priestley's individual biography and the larger networks and movements he participated in, so in part this is a bit of a ramble through the 18th century English-speaking house of intellect, which is not a bad thing.
There is, as the subtitle indicates, special emphasis — more than perhaps is truly warranted — on his American connections. This is because the book is very much an attempt by Johnson to claim Priestly as part of a usable American past of Enlightenment progressivism, in which there is no tension between rational religion and scientific advance. There is nothing wrong with this — quite the reverse; this is a part of our national traditions, and we should emphasize it — but at the same time it leads Johnson into some choices in his writing which feel like they make this book more transient than it needed to be.
Annoyances: (1) the formulaic opening scene. (2) the idea that the physiological effects of coffee sparked the Age of Reason was something I tossed off as a joke, complete with counter-examples, several years ago. I am displeased to see this same conceit here (pp. 54ff), being taken seriously not just by Johnson but evidently by others — not, to be clear, because I think I deserve credit (I'm sure it's not original), but because it is stupid.
Errata: p. 20, for "1850s" read "1750s"; p. 22, for "mid-seventeenth" read "mid-eighteenth".
H. P. Lovecraft, The Tomb, and Other Tales
Someone has already expressed my sentiments in lolcat form. (To say the least, Red Hook's changed.) But despite that there is a certain power to the stories.
Jeffrey Alford and Naomi Duguid, Mangoes and Curry Leaves: Culinary Travels through the Great Subcontinent
Very good recipes and nice travel writing, plus lickable-looking photographs.
Hiroshi Kondo, The Book of Saké (a.k.a. Saké: A Drinker's Guide)
Social history, lore and etiquette, a lovingly geeky description of the fermentation process (complete with graphs!), and specific recommendations. (Many thanks to K. for the gift, and for the Yuki no Bosha junmai ginjo.)

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; The Progressive Forces; Networks; The Dismal Science; Enigmas of Chance; Writing for Antiquity; The Great Transformation; Food The Running Dogs of Reaction; The Beloved Republic

Posted by crshalizi at January 31, 2009 23:59 | permanent link

January 30, 2009

Bayes < Darwin-Wallace

ALL YOUR BAYES ARE BELONG TO US
we share with him
Attention Conservation Notice: jargony, self-promotional ramblings about a new paper on the frequentist properties of non-parametric Bayesian procedures, which is exactly as statistics-geeky as it sounds. Plus a weird analogy to mathematical models of evolution. Even if this sounds vaguely interesting, you could always check back later and see if peer review exposed it as a tissue of fallacies.

Here's the new preprint:

CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified Models", arxiv:0901.1342
Abstract: Recent work on the convergence of posterior distributions under Bayesian updating has established conditions under which the posterior will concentrate on the truth, if the latter has a perfect representation within the support of the prior, and under various dynamical assumptions, such as the data being independent and identically distributed or Markovian. Here I establish sufficient conditions for the convergence of the posterior distribution in non-parametric problems even when all of the hypotheses are wrong, and the data-generating process has a complicated dependence structure. The main dynamical assumption is the generalized asymptotic equipartition (or "Shannon-McMillan-Breiman") property of information theory. I derive a kind of large deviations principle for the posterior measure, and discuss the advantages of predicting using a combination of models known to be wrong. An appendix sketches connections between the present results and the "replicator dynamics" of evolutionary theory.

There are people out there who see Bayes's Rule as the key to all methodologies, something essential to rationality. I find this view thoroughly misguided and not even a regulative ideal. But I've written about that at great length elsewhere and won't repeat myself here.

While there are certainly plenty of statisticians who embrace the Bayesian way in all its Savagery (some of them will be on my tenure committee), I think it's fair to say that most of the time, Bayesian methods are not used in contemporary statistics and machine learning to impose coherence on statisticians', or computers', personal beliefs. When people like Andy Gelman, Aleks Jakulin &c. estimate logistic regressions with a particular default prior, what they are doing is really regularization, which is something that a frequentist like me can understand.

Actually, Bayesian inference is a regularization-by-smoothing technique, only in the space of probability distributions instead of the sample space. In ordinary least-squares regression, one asks "what straight line comes closest (in the Euclidean sense) to all the data points?" In subtler methods, one has a whole menagerie of possible curves, and asks which superposition of curves comes closest to the data points. To avoid over-fitting, one needs to keep the weight given to very wiggly basis curves under control, e.g., by simply ignoring possible terms which are not sufficiently smooth. (Through the magic of Lagrange multipliers, this is equivalent to penalizing wiggliness.) As one gets more and more data, finer and finer features of the regression curve can be reliably discerned, so the amount of imposed smoothing should be relaxed. For estimating regression curves, this is all thoroughly understood and even automated (making the persistence of linear regression a mystery; but another time).

All of this carries over to estimating entire distributions rather than just curves. In the space of probability distributions, the entire sample is a single point, the empirical distribution. (I'm ignoring some subtleties about time series here.) The family of models we're willing to consider forms a manifold (or set of manifolds; more details) in the space of probability measures. In this space, the right way to measure distance isn't the Euclidean metric, but rather the relative entropy, a.k.a. the Kullback-Leibler divergence, which generates the information geometry. The maximum likelihood estimate is simply the geometric projection of the empirical distribution on to the manifold of models. In other words, the most likely predictive distribution comes from this single point on the manifold. (There are cleverer ways to get frequentist predictive distributions, though.) The Bayesian posterior predictive distribution gives some weight to all points on the manifold; this is the smoothing. The weights are proportional to the prior, and also decline exponentially with the product of sample size and relative entropy. The effect of the prior is to blunt the impact of what we actually see, keeping us from simply trying to match it — exactly like the effect of smoothing. The amount of smoothing done by the prior declines as the sample size grows.

Why smooth or regularize in the first place? As every school-child knows, the answer lies in the bias-variance trade-off. (Pedantically, bias-variance is for mean-squared error, but other loss functions have similar decompositions.) Regularized estimates are less responsive to differences between data sets, which means that they have less variance. No matter what the data look like, the regularized estimates all have the same shape, so to speak. But unless reality also has that shape, this creates a systematic error, i.e., bias. So the more aggressively we smooth, the more we decrease variance, but the more we increase bias. If some degree of smoothing removes more variance than it adds bias, we come out ahead. Of course, the variance should fall automatically as we get more data, but the bias depends on the strength of smoothing, so the latter should shrink as the sample grows. The important thing in analyzing this type of estimation or prediction scheme is to check whether they relax their regularization at the right rate, so that they don't get killed by either bias or variance, but rather consistently converge on the truth, or at least the best approximation to the truth among the available models.

All of this applies to Bayesian learning. Like any other regularization scheme, it is a way of deliberately introducing bias into our inferences, not so much on Burkean/Quinean grounds that our prejudices providentially point to the truth, but simply to reduce variance, the extent to which our inferences are at the mercy of Fortune (in her role as the goddess of sampling fluctuations). The question about it then becomes whether it gets the trade-off right, and manages to converge despite its bias. In other words, when does Bayesian inference possess frequentist consistency?

A surprisingly large number of people have been satisfied with a result given by the great probabilist J. L. Doob, which essentially says that under some reasonable-looking conditions, the Bayesian learner's prior probability of being inconsistent is zero. (More exactly, the posterior probability of any set of models containing the truth goes to 1, except on a set of sample paths whose prior probability is zero.) Even taken at face value, this just says that each Bayesian is convinced a priori that they'll converge on the truth, not that they actually are almost sure to find the truth.

Examining the reasonable-looking conditions of Doob's result, it turns out that they entail the existence of a consistent non-Bayesian estimator. (Doob's original assumptions can actually be weakened to just the existence of such an estimator; see Schervish's Theory of Statistics, ch. 7.) It is a curious fact that every proof of the consistency of Bayesian learning I know of requires the existence of a consistent non-Bayesian estimator. (Though it need not be the maximum likelihood estimator.) There don't seem to be any situations where Bayesian updating is the only convergent route to the truth.

It turns out that Bayesian inference is not consistent in general. The late David Freedman and the very-much-alive Persi Diaconis showed that if you choose a prior distribution which is badly adapted to the actual data-generating process, your posterior distribution will happily converge on the wrong model, even though the rest of the set-up is very tame — independent and identically distributed data in a boringly well-behaved sample space, etc.

Still, there are many situations where Bayesian learning does seem to work reasonably effectively, which in light of the Freedman-Diaconis results needs explaining, ideally in a way which gives some guidance as to when we can expect it to work. This is the origin of the micro-field of Bayesian consistency or Bayesian nonparametrics, and it's here that I find I've written a paper, rather to my surprise.

I never intended to work on this. In the spring of 2003, I was going to the statistics seminar in Ann Arbor, and one week the speaker happened to be Yoav Freund, talking about this paper (I think) on model averaging for classifiers. I got hung up on why the weights of different models went down exponentially with the number of errors they'd made. It occurred to me that this was what would happen in a very large genetic algorithm, if a solution's fitness was inversely proportional to the number of errors it made, and there was no mutation or cross-over. The model-averaged prediction would just be voting over the population. This made me feel better about why model averaging was working, because using a genetic algorithm to evolve classifier rules was something I was already pretty familiar with.

The next day it struck me that this story would work just as well for Bayesian model averaging, with weights depending on the likelihood rather than the number of errors. In fact, I realized, Bayes's rule just is the discrete-time replicator equation, with different hypotheses being so many different replicators, and the fitness function being the conditional likelihood.

As you know, Bob, the replicator dynamic is a mathematical representation of the basic idea of natural selection. There are different kinds of things, the kinds being called "replicators", because things of one kind cause more things of that same kind to come into being. The average number of descendants per individual is the replicator's fitness; this can depend not only on the properties of the replicator and on time and chance, but also on the distribution of replicators in the population; in that case the fitness is "frequency dependent". In its basic form, fitness-proportional selection is the only evolutionary mechanism: no sampling, no mutation, no cross-over, and of course no sex. The result is that replicators with above-average fitness increase their share of the population, while replicators with below-average fitness dwindle.

This is a pretty natural way of modeling half of the mechanism Darwin and Wallace realized was behind evolution, the "selective retention" part — what it leaves out is "blind variation". Even with this omission, the replicator equation is a surprisingly interesting kind of dynamical system, especially when fitness is frequency-dependent, which opens up deep connections to evolutionary game theory. (ObBook.) Interestingly, however, Bayes is a very limited special case of the replicator equation, since fitness is frequency independent.

"Selective retention" is also the idea that lies behind reinforcement learning and Thorndike's law of effect. Crudely, these are all variations on the theme of "do more of what worked, and less of what didn't". Less crudely, there are at least three independent discoveries of how reinforcement learning, itself, leads to the replicator equation in the appropriate limit. So Bayesian updating isn't just a special case of evolutionary optimization; it's also something like habit formation.

Initially, I wrote this all up in the spring of 2003 and then set it aside, because, after making the formal connections, I didn't see what it was good for. Then, that summer, I went to the first "Science et Gastronomie" workshop, talked about this idea, and realized, from the conversations, that I actually could do something with it. The fitness function was going to end up being the relative entropy rate, and this would control where the posterior distribution concentrated in the long run. This would let me say something about the convergence of Bayesian learning with non-independent and even non-Markovian data, but also about what happens when the true distribution is not "in the support of the prior", i.e., when all the models really are wrong.

So I spent a lot of time in Lyon sweating over the asymptotic behavior of the integrated likelihood and so forth (literally, given the great heat-wave). By the end of the summer, I had versions of what are now Theorems 2, 3 and 4 in the paper. These say, respectively, that the posterior density goes to zero everywhere where fitness isn't maximized, and the rate is the fitness difference; that eventually the posterior distribution concentrates on the global peaks of the fitness landscape; and that the posterior distribution in a subset of model space not including those peaks is driven by the highest fitness in the subset. All of this was pretty nice and I was fairly pleased with myself.

Then I saw that if my results were correct, Bayesian updating should always be consistent. So I put the thing aside for a while....

Eventually I figured out what I was doing wrong; I was presuming that all the points in model-space were equally well-behaved. I was explicitly assuming that the (log) likelihood would eventually converge to the relative entropy for each hypothesis. This is on the one hand mathematically harmless (it's asymptotic equipartition), and on the other hand statistically I can't see how any likelihood-based method can hope to converge unless performance over a sufficiently long past is indicative of future results. (This is why such methods do not solve the problem of induction.) But I was further assuming, implicitly, that there was no way for the likelihood of a hypothesis with very high relative entropy to also converge very slowly. That is, I was assuming the Bayesian learner could not acquire bad habits. But of course it can; the bad habits just need to seem like good ideas at first. In human terms:

No one ever forgets how to do something that's worked for them in the past. Just replacing it with another behavior can be hard enough, and the old behavior is still going to be lurking there underneath it. Thieves keep stealing. Liars keep lying. Drunks never forget about chemically modifying their nervous systems.
Similarly, the Bayesian learner never forgets about the model which matches the data perfectly — until it turns out that the model had just memorized the first 13,127 observations, and then repeats them forever. When that model crashes, however, there is always another one which memorized the first 26,254 observations...

What's needed is some restrictions on the prior distribution which keep it from putting too much weight on these bad hypotheses, though actually in non-parametric problems one doesn't want to give them strictly zero weight, because, hey, maybe a particular sequence of 13,127 observations really will repeat forever. (Metaphorically: a little lying, stealing and drinking is part of a complete life.) We are back to regularization, which has the duty of promoting virtue and suppressing vice.

The most common ways of doing this divide the space of models into distinct classes, and then use statistical learning theory or empirical process theory to gauge the risk of over-fitting within these constrained subspace. Typical arguments involve things like showing that every hypothesis in the constrained set is close to one of a finite number of hypotheses which "cover" or "bracket" the space, which gives uniform convergence. Then one relaxes the constraint as more data arrives, according to some deterministic schedule ("the method of sieves") or to optimize the trade-off between actual performance and a bound on over-fitting ("structural risk minimization"), etc.

Existing proofs of Bayesian consistency in non-parametric problems basically work similarly. To simplify just a bit, the trick has two parts. The first is to find constraints on the hypotheses which are strong enough to ensure uniform convergence, but can be relaxed to include everything; this is what you'd do anyway if you wanted a consistent non-parametric procedure. (Some people don't explicitly posit constraint sets, but do other things with much the same effect.) The second is to construct a prior whose bias towards the most-constrained sets is strong enough to keep the wild, high-capacity parts of the hypothesis space from dominating the posterior distribution, but isn't so biased that the constraint can't be overcome with enough data.

This is what I ended up having to do. (There was a gap of several years between seeing that some constraint was needed, and seeing what constraint would work. I am, in fact, a sloth.) My constraint was, roughly, "the log likelihood converges at least this fast". Unfortunately, I wasn't able to express it in relatively nice terms, like covering numbers, though I suspect someone cleverer could replace it with something along those lines. (It's about one-sided deviations of time-averages from their limiting values, so it feels empirical-process-y.) Anyone who actually cares will actually be better served by reading the paper than by my trying to re-express it verbally.

I didn't figure out the right constraint to impose, and the right way to relax it, until the summer of 2008. (This was what was on my mind when I read Chris Anderson's ode to overfitting.) Once I did, everything else was pretty direct, especially since it turned out I could salvage most (but definitely not all) of what I'd done in Lyon. One of the bigger efforts in actually writing the paper was uniformly eliminating talk of replicators and fitness, except for a small appendix, in favor of more statistical jargon. (I hope I succeeded.) Adding an example, namely trying to use Markov chains to predict a sofic system, took about a day. This is an amusing case, because the posterior distribution never converges — you can always do strictly better by moving to Markov chains of higher order. Nonetheless, the (Hellinger) distance between the posterior predictive distribution and the actual predictive distribution goes to zero.

Having submitted this, I'm going to put it aside again until I hear from the referees, because there's a lot of other work which needs finishing. (When I do hear from the referees, I anticipate a certain amount of gnashing of teeth.) But in the meanwhile, I'll use this space to jot down some ideas.

Manual Trackback: 3 Quarks Daily

Thanks to Nicolas Della Penna, Shiva Kaul and Chris Wiggins for typo-spotting.

Enigmas of Chance; Self-Centered

Posted by crshalizi at January 30, 2009 22:47 | permanent link

January 29, 2009

Deep Thought (On the Ethical Relevance of Economic Analysis)

Restoring stolen goods to their rightful owners is not Pareto-improving.

The Dismal Science

Posted by crshalizi at January 29, 2009 09:08 | permanent link

January 20, 2009

"To choose our better history"

Now would be an excellent time to begin working like we are living in the early days of a better nation.

Let those who do justice and love mercy say amen.

The Beloved Republic; The Progressive Forces

Posted by crshalizi at January 20, 2009 23:11 | permanent link

January 10, 2009

Geoghegan for Congress

If, even a month ago, you had asked me who I'd most like to see elected to Congress, I would not have mentioned Tom Geoghegan, because it wouldn't have even occurred to me that it was possible. Had you asked me about him in particular, I would have replied something on the order of "We should be so lucky". But he's running in the special election to replace Rahm Emanuel, and, well, we the people should be so lucky. The estimable Kathy G. has explained why Geoghegan deserves the support of progressives everywhere. I have nothing to add, other than to say that (1) if you haven't read Which Side Are You On?, you should buy it right now and read it at once; and (2) I have just put my money where my mouth is.

The Progressive Forces; The Beloved Republic

Posted by crshalizi at January 10, 2009 21:45 | permanent link

But It's Only a Little Demon

Attention conservation notice: ~500 words of excessively cute foundations-of-statistical-mechanics geekery. Inspired by this post at The Statistical Mechanic.

I have here on the table before me my favorite classical-mechanical assemblage of interacting particles, with 2n degrees of freedom, n being a macroscopically large number. (The factor of 2 is both because there are always position and velocity degrees of freedom, and to avoid some factors of 1/2 later.) It is in turn part of a larger assemblage with many more degrees of freedom, say 2N. Both the smaller and larger assemblages are highly unstable dynamically, so I can expect statistical mechanics to work quite well. (Really, I can.) On the other hand, I presume that they are very thoroughly isolated from the rest of the universe, so I can ignore interactions with the outside. (Don't ask me how I know what's going on in there in that case, though.)

I have also an Aberdeen Mfg. Mk. II "Neat-fingered" Maxwellian demon, which is capable of instantaneously reversing all the velocities of the particles in the small assemblage (i.e., it can flip the sign of n velocity degrees of freedom). If I had a bigger research budget, I could have bought a Mk. V "Vast and Considerable" demon, which could reverse the whole assemblage's N velocity degrees of freedom, but I don't have to tell you about grants these days.

Now, with the Mk. V, I'd know what to expect: it's the old familiar myth about time's arrow running backwards: sugar spontaneously crystallizing out of sweetened coffee, forming granules and leaping out of the cup into the tea-spoon, etc. But the Mk. II isn't capable of reversing the arrow of time for the whole assemblage, just for part of it. And so there are N-n degrees of freedom in the larger assemblage whose arrow of time points the same way as before. So what happens?

My intuition is that at first the arrow of time is reserved in the small assemblage, leading to the local equivalent of coffee unsweetening. ("At first" according to who? Don't go there.) Eventually, however, interactions with the N-n unreversed degrees of freedom should bring the n degrees of freedom back into line. If interactions are spatially local, then I imagine the time-reversed region gradually shrinking. Mythologically: The sugar crystallizes and forms granules, say, and even starts to leap out of the cup, but neither the air molecules nor the spoon are in the right place at the right time to exactly take them back to sugar-jar, so they spill and make a mess, etc. More generally, an observer within the larger assemblage will first see a small region where, bizarrely, things happen in reverse, then a kind of hard-to-describe crawling molecular chaos, and then a restoration of the ordinary macroscopic natural order, albeit from a weird starting point. But this guess may be excessively shaped by the fluctuation-dissipation theorem. Does a single arrow of time have to get established at all? If so, how long does it typically take? (Intuition again, this time from large deviations: exponential in 2n-N.) Can the n reversed degrees of freedom ever impose their direction on the whole assemblage?

Somebody must have already looked into all this. Where?

Update, later that afternoon: I was probably unconsciously remembering this post by Sean Carroll. (Sean was very polite in pointing this out.) Also, John Burke answers my final "where" question "Budapest", which sounds about right.

Physics

Posted by crshalizi at January 10, 2009 13:45 | permanent link

January 08, 2009

Chaos, Complexity and Inference: 2009 Course Announcement

I will be teaching 36-462, "topics in statistics", in the spring. This is a special topics course for advanced undergraduates, intended to expose them ideas they wouldn't see going through the ordinary curriculum; this year, like last, the subject will be "chaos, complexity and inference". It worked pretty well last time, though I am not sure what to make of the fact that half of those currently registered are graduate students from other departments...

Anyone interested in the readings, assignments and notes can follow them either at the class syllabus page, or from this post and its RSS feed. (Last year's post.) — There have been some requests (as in, more than one!) for podcasts of the lectures. If someone will point me at an idiot's guide to setting one up, and tell me about cheap but adequate microphones, I'm willing to try.

Description
This course will cover some key parts of modern theories of nonlinear dynamics ("chaos") and complex systems, and their connections to fundamental aspects of probability and statistics. By studying systems with many strongly-interacting components, students will learn how stochastic models can illuminate phenomena beyond the usual linear/Gaussian/independent realm, as well as gain a deeper understanding of why stochastic models work at all. Topics will include: chaos theory and nonlinear prediction; information; the distinction between randomness and determinism; self-organization and emergence; heavy-tailed and "scale-free" distributions; social and other complex networks, and the analysis of network data; interacting agents; and inference from simulations.
Full Syllabus
At the course webpage, together with links to readings
Venue
Tuesdays and Thurdays 12:00--1:20 in Scaife Hall 208. Office hours in 229C Baker Hall, Wednesdays 10--11 and Thursdays 4--5.
Required Textbooks
Gary William Flake, The Computational Beauty of Nature
John Miller and Scott Page, Complex Adaptive Systems
Leonard Smith, Chaos: A Very Short Introduction
Optional Textbooks
Peter Guttorp, Stochastic Modeling of Scientific Data
Paul Krugman, The Self-Organizing Economy
Andrew M. Fraser, Hidden Markov Models and Dynamical Systems
W. John Braun and Duncan J. Murdoch, A First Course in Statistical Programming with R (Use of R is not required, but ask before using other languages.)
Prerequisites
A previous course in mathematical statistics (such as 36-310, 36-401, or 36-625/626) and a course in probability including random processes (such as 36-217, 36-225/226, 36-410, or 36-625/626)
or consent of instructor.
(See the handout for more on required background.)
Some programming experience will be extremely helpful.

Corrupting the Young; Complexity; Enigmas of Chance

Posted by crshalizi at January 08, 2009 23:59 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems