October 21, 2013

Simulation III: Monte Carlo and Markov Chain Monte Carlo (Introduction to Statistical Computing)

Lecture 16: The Monte Carlo principle for numerical integrals: write your integral as an expectation, take a sample. Examples. Importance sampling: draw from a distribution other than the one you really are want, then weight the sample values. Markov chain Monte Carlo for sampling from a distribution we do not completely know: the Metropolis algorithm. Gibbs sampling. Bayesian inference via MCMC.

Readings: Handouts on Markov Chains and Monte Carlo, and on Markov Chain Monte Carlo

Optional readings: Charles Geyer, "Practical Markov Chain Monte Carlo", Statistical Science 7 (1992): 473--483; "One Long Run"; Burn-In is Unnecessary; On the Bogosity of MCMC Diagnostics.

Update, 22 December: If you do read Geyer, it's also worth reading two posts by Andrew Gelman (A Centipede Many Times Over and A Tale of Two Discussion Papers), and Gelman and Rubin's "Inference from Iterative Simulation Using Multiple Sequences" (Statistical Science 7 (1992): 457--472). Thanks to Andy for reminding me (politely!) about these pieces.

Introduction to Statistical Computing

Posted by crshalizi at October 21, 2013 13:58 | permanent link

Simulation II: Markov Chains (Introduction to Statistical Computing)

Lecture 15: Combing multiple dependent random variables in a simulation; ordering the simulation to do the easy parts first. Markov chains as a particular example of doing the easy parts first. The Markov property. How to write a Markov chain simulator. Verifying that the simulator works by looking at conditional distributions. Variations on Markov models: hidden Markov models, interacting processes, continuous time, chains with complete connections. Asymptotics of Markov chains via linear algebra; the law of large numbers (ergodic theorem) for Markov chains: we can approximate expectations as soon as we can simulate.

Readings: Handouts on Markov chains and Monte Carlo

Introduction to Statistical Computing

Posted by crshalizi at October 21, 2013 13:57 | permanent link

Simulation I: Generating Random Variables (Introduction to Statistical Computing)

Lecture 14: Why simulate? Generating random variables as first step. The built-in R commands: rnorm, runif, etc.; sample. Some uses of sampling: permutation tests; bootstrap standard errors and confidence intervals. Transforming uniformly-distributed random variables into other distributions: the quantile trick; the rejection method; illustration of the rejection method. Understanding pseudo-random number generators: irrational rotations; the Arnold cat map as a toy example of an unstable dynamical system; illustrations of the Arnold cat map. Controlling the random number seed.

Readings: Matloff, chapter 8; The R Cookbook, chapter 8

Introduction to Statistical Computing

Posted by crshalizi at October 21, 2013 13:56 | permanent link

"Significance Tests for Adaptive Modelling" (Today at the Statistics Seminar)

Attention conservation notice: Late notice of a very technical presentation about theoretical statistics in a city you don't live in.

Today's speaker needs no introduction for those interested in modern, high-dimensional statistics (but will get an introduction anyway):

Robert Tibshirani, "Significance Tests for Adaptive Modelling"
Abstract: In this talk we consider testing the significance of the terms in a fitted regression, fit via the lasso. We propose a novel test statistic for this problem, and show that it has a simple asymptotic null distribution. This work builds on the least angle regression approach for fitting the lasso, and the notion of degrees of freedom for adaptive models (Efron 1986) and for the lasso (Efron et al. 2004; Zou et al. 2007). We give examples of this procedure and discuss extensions to generalized linear models. In the second part of the talk we will discuss extensions of this work to graphical models, specifically inference for connected components.
Joint work with Richard Lockhart, Jon Taylor, Ryan Tibshirani, and Max Jacob Grazier G'Sell.
Time and place: 4:30--5:30 pm on Monday, 21 October 2013, in Baker Hall 136A

As always, the talk is free and open to the public.

Enigmas of Chance

Posted by crshalizi at October 21, 2013 13:55 | permanent link

October 11, 2013

Midterm Exam (Introduction to Statistical Computing)

Midterm Exam: eight questions about thirteen lines of code.

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:48 | permanent link

Split, Apply, Combine: Using plyr (Introduction to Statistical Computing)

Lecture 13, Split/apply/combine II: using plyr. Abstracting the split/apply/combine pattern: using a single function to appropriately split up the input, apply the function, and combine the results, depending on the type of input and output data. Syntax details. Examples: standardizing measurements for regularly-sampled spatial data; standardizing measurements for irregularly-sampled spatial data; more fun with strikes and left-wing politics. Limitations of the split/apply/combine pattern.

Shorter lecture 13: The lecturer is a gushing Hadley Wickham fanboy.

(This week's lectures are ripped off from slides by Vince Vu, with permission.)

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:47 | permanent link

Split, Apply, Combine: Using Basic R (Introduction to Statistical Computing)

Lecture 12: Design patterns and their benefits: clarity on what is to be done, flexibility about how to do it, ease of adapting others' solutions. The split/apply/combine pattern: divide big structured data sets up into smaller, related parts; apply the same analysis to each part independently; combine the results of the analyses. Trivial example: row and column means. Further examples. Iteration as a verbose, painful and clumsy implementation of split/apply/combine. Tools for split/apply/combine in basic R: the apply function for arrays, lapply for lists, mapply, etc.; split. Detailed example with a complicated data set: the relation between strikes and parliamentary politics.

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:46 | permanent link

Homework: I Made You a Likelihood Function, But I Ate It (Introduction to Statistical Computing)

In which we continue to practice using functions as arguments and as return values, while learning something about the standard error of maximum likelihood estimates, and about the modularity of methods like the jack-knife.

Assignment

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:45 | permanent link

Lab: I Can Has Likelihood Surface? (Introduction to Statistical Computing)

In which we practice passing functions as arguments to other functions, by way of an introduction to likelihood and its maximization; and, incidentally, work more with plotting in R.

Assignment

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:44 | permanent link

Abstraction and Refactoring (Introduction to Statistical Computing)

Lecture 11: Abstraction as a way to make programming more friendly to human beings. Refactoring as a form of abstraction. The rectification of names. Consolidation of related values into objects. Extracting common operations. Defining general operations. Extended example with the jackknife.

Reading: sections 14.1--14.3 in Matloff.

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:43 | permanent link

Simple Optimization (Introduction to Statistical Computing)

Lecture 10: Basics from calculus about minima. Taylor series. Gradient descent and Newton's method. Curve-fitting by optimization. Illustrations with optim and nls. R for examples

Reading: recipes 13.1 and 13.2 in The R Cookbook; chapters I.1, II.1 and II.2 in Red Plenty

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:42 | permanent link

Homework: Dimensions of Anomaly (Introduction to Statistical Computing)

In which we continue to practice the arts of debugging and testing, while learning about making our code more general, handling awkward special cases, and pondering what it means to say that an observation is an outlier.

Assignment, data, deliberately-buggy code

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:41 | permanent link

Lab: Testing Our Way to Outliers (Introduction to Statistical Computing)

In which we use Tukey's rule for identifying outliers as an excuse to learn about debugging and testing.

Assignment

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:40 | permanent link

Functions as Objects (Introduction to Statistical Computing)

Lecture 10: Functions in R are objects, just like everything else, and so can be both arguments to and return values of functions, with no special machinery required. Examples from math (especially calculus) of functions with other functions as arguments. Some R syntax relating to functions. Examples with curve. Using sapply to extend functions of single numbers to functions of vectors; its combination with curve. We write functions with lower-level functions as arguments to abstract out a common pattern of operations. Example: calculating a gradient. Numerical gradients by first differences, done two different ways. (Limitations of taking derivatives by first differences.) Incorporating this as a part of a larger algorithm, such as gradient descent. Using adapters, like wrapper functions and anonymous functions, to fit different functions together. Examples from math (especially calculus) of operators, which turn one function into another. The importance of scoping when using functions as return values. Example: creating a linear predictor. Example: implementing the gradient operator (two different ways). Example: writing surface, as a two-dimensional analog to the standard curve. The use of eval and substitute to control when and in what context an expression is evaluated. Three increasingly refined versions of surface, employing eval.

R for examples.

Introduction to Statistical Computing

Posted by crshalizi at October 11, 2013 17:39 | permanent link

Triple Header (Next Week at the Statistics / Machine Learning Seminars)

Attention conservation notice: Only relevant if you (1) really care about statistics, and (2) will be in Pittsburgh on Monday.

Through a fortuitous concourse of calendars, we will have three outstanding talks on Monday, 14 October 2013. In chronological order:

Michael I. Jordan, "On the Computational and Statistical Interface and 'Big Data'" (special joint statistics/ML seminar)
Abstract: The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the statistical and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level---in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. Indeed, if data are a data analyst's principal resource, why should more data be burdensome in some sense? Shouldn't it be possible to exploit the increasing inferential strength of data at scale to keep computational complexity at bay? I present three research vignettes that pursue this theme, the first involving the deployment of resampling methods such as the bootstrap on parallel and distributed computing platforms, the second involving large-scale matrix completion, and the third introducing a methodology of "algorithmic weakening," whereby hierarchies of convex relaxations are used to control statistical risk as data accrue.
(Joint work with Venkat Chandrasekaran, Ariel Kleiner, Lester Mackey, Purna Sarkar, and Ameet Talwalkar.)
Time and place: Noon, Rangos 2, University Center
David Choi, "Testing for Coordination and Peer Influence in Network Data" (machine learning and the social sciences seminar)
Abstract: Many tests have been proposed for the detection of "viral" peer influence in observational studies involving social network data. However, these tests typically make strong (and sometimes unstated) modeling assumptions on participant behavior. We propose a test which holds under less restrictive assumptions, and which controls for unobserved homophily variables that are unaccounted for in existing methods. We discuss conditions under which the test is valid, and give preliminary results on its effectiveness.
Time and place: 3 pm in Gates Hall 4405
Genevera Allen, "Sparse and Functional Principal Components Analysis"
Abstract: Regularized principal components analysis, especially Sparse PCA and Functional PCA, has become widely used for dimension reduction in high-dimensional settings. Many examples of massive data, however, may benefit from estimating both sparse AND functional factors. These include neuroimaging data where there are discrete brain regions of activation (sparsity) but these regions tend to be smooth spatially (functional). Here, we introduce an optimization framework that can encourage both sparsity and smoothness of the row and/or column PCA factors. This framework generalizes many of the existing approaches to Sparse PCA, Functional PCA and two-way Sparse PCA and Functional PCA, as these are all special cases of our method. In particular, our method permits flexible combinations of sparsity and smoothness that lead to improvements in feature selection and signal recovery as well as more interpretable PCA factors. We demonstrate our method on simulated data and a neuroimaging example of EEG data. This work provides a unified optimization framework for regularized PCA that can form the foundation for a cohesive approach to regularization in high-dimensional multivariate analysis.
Time and place: 4 pm in Doherty Hall 1212

As always, the talks are free and open to the public.

Enigmas of Chance; Networks

Posted by crshalizi at October 11, 2013 17:27 | permanent link

October 01, 2013

October First Is Too Late

Have some delayed book-chat from April, May, June, July, August and September.

Posted by crshalizi at October 01, 2013 17:50 | permanent link

September 30, 2013

Books to Read While the Algae Grow in Your Fur, September 2013

Betsy Sinclair, The Social Citizen: Peer Networks and Political Behavior
Introspectively, it seems pretty obvious that our political behavior and attitudes are influenced by our friends and family; this is an attempt to document and quantify that influence, in the context of modern American politics. Sinclair looks at influence on going to the polls, on donating to campaigns, on who we vote for, and on party identification.
The evidence on voter turnout is experimental: randomly encouraging one person to vote makes those in the same household, who were not experimented on, more likely to vote themselves. This has the same compelling logic as any good experiment: manipulating A makes B move, so there must be some mechanism connecting A to B.
For donations, candidate choice, and party identification, Sinclair's evidence is observational, from surveys. These studies generally do not, in fact, collect social networks, but rather rely on people reporting on attributes of their networks and their friends and family, or even proxying social networks by demographic information about neighborhoods. What Sinclair reports here is intuitively plausible, but all of it is vulnerable to confounding due to homophily. I don't have a better idea about how to address these problems in observational studies, and Sinclair and her co-authors have done about1 as well as anyone could have with this data, but it's still only suggestive.
[1]: Assume for the sake of argument that every attribute on which people are homophilous is accurately measured by the survey and included in the regression. To support the sort of strong scientific interpretation that Sinclair (reasonably!) wants to give the regression results, we'd also have to know that the functional form of the regression was well-specified. This is can be checked, but apparently wasn't. (This lapse is no less serious for being so very common.) I also think there is a logical error in the discussion of the panel data on candidate preferences (pp. 95ff). I see no reason why homophily couldn't lead to voters selecting friends who will respond similarly to what happens during a political campaign, which would tend to make their vote choices increasingly aligned. ^
(Disclaimer: Prof. Sinclair was kind enough to invite me to a conference she organized on this subject earlier this year. I should maybe also mention that I'm happy with how this book cites my work.)
Manning Marable, Malcolm X: A Life of Reinvention
A really magnificent biography of a remarkable American, by a great scholar coming from a viewpoint close enough to be sympathetic but distant enough to be sharply critical.
Tarquin Hall, The Case of the Missing Servant: From the Files of Vish Puri, Most Private Investigator; The Case of the Man Who Died Laughing; The Case of the Deadly Butter Chicken
Mind candy: humorous murder mysteries from contemporary India, principally Delhi.

Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; The Beloved Republic; Networks; Commit a Social Science

Posted by crshalizi at September 30, 2013 23:59 | permanent link

September 24, 2013

"Binomial Likelihoods and the Polya-Gamma Distribution" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about computational statistics, and (2) will be in Pittsburgh next Monday.

Having a talk on Bayesian computational statistics by a Dr. Scott worked so well last time, we're doing it again:

James Scott, "Binomial Likelihoods and the Polya-Gamma Distribution"
Abstract:Bayesian inference for the logistic regression model has long been recognized as a hard problem. By comparison, Bayesian inference for the probit model is much easier, owing to the simple latent-variable method of Albert and Chib (1993) for posterior sampling.
In the two decades since the work of Albert and Chib on the probit model, there have been many attempts to apply a similar computational strategy to the logit model. These efforts have had mixed results: all such methods are either approximate, or are significantly more complicated than the Albert/Chib method. Perhaps as a result, the Bayesian treatment of the logit model has not seen widespread adoption by non-statisticans in the way that, for example, the Bayesian probit model is used extensively in political science, market research, and psychometrics. The lack of a standard computational approach also makes it more difficult to use the logit link in the kind of rich hierarchical models that have become routine in Bayesian statistics.
In this talk, I propose a new latent-variable representation for binomial likelihoods. It appeals to a new class of distributions, called the Polya-Gamma family. Although our method involves a different missing-data mechanism from that of Albert and Chib, it is nonetheless a direct analogue of their construction, in that it is both exact and simple. I will describe the Polya-Gamma method in detail; demonstrate its superior efficiency; and highlight a few examples where it has proven helpful. I will conclude by drawing an interesting connection with variational methods.
Joint work with Jesse Windle and Nicholas Polson.
Time and place: 4:30--5:30 pm on Monday, 30 September 2013, place TBA (note unusual time)

Enigmas of Chance

Posted by crshalizi at September 24, 2013 16:00 | permanent link

Causation, Prediction and Search +20

Attention conservation notice: Log-rolling promotion a conference at the intersection of the margins of several academic fields.

I've written before about how one of Causation, Prediction and Search was one of the books which awakened my interest in machine learning and modern statistics. The point of the book was to explore when and how one can actually discover causal relations from observations. The CMU philosophy department being what it is, they did this by devising computational representations of causal structure, and effective procedures for causal discovery, and proving that the procedures would work under specific (sane) conditions. This message has shaped my research and my teaching ever since. It's one of the reasons I was so eager to come to CMU.

Of course, for good pragmatists, the proof of any method is in its results, and that's why I'm very pleased to help publicize this:

Case Studies of Causal Discovery with Model Search
Description: Computer scientists, statisticians, and philosophers have created a precise mathematical framework for representing causal systems called "Graphical Causal Models." This framework has supported the rigorous description of causal model spaces and the notion of empirical indistinguishability/equivalence within such spaces, which has in turn enabled computer scientists to develop asymptotically reliable model search algorithms for efficiently searching these spaces. The conditions under which these methods are practically useful in applied science is the topic of this workshop. The workshop will bring together scholars from genetics, biology, economics, fMRI-based cognitive neuroscience, climate research, education research, and several other disciplines, all of whom have successfully applied computerized search for causal models toward a scientifically challenging problem. The goals for the workshop are to: (1) to identify strategies for applying causal model search to diverse domain-specific scientific questions; (2) to identify and discuss methodological challenges that arise when applying causal model search to real-world scientific problems; and (3) to take concrete steps toward creating an interdisciplinary community of researchers interested in applied causal model search. We welcome junior scholars and graduate students, and we will host a free introductory tutorial on model search the first morning of the workshop.
Time and place: 25--27 October 2013, Carnegie Mellon University

(Yet another sign of the passage of time is that one of the organizers is Lizzie Silver, who helped perpetrate this when she took 36-350.)

Constant Conjunction Necessary Connexion; Kith and Kin

Posted by crshalizi at September 24, 2013 15:00 | permanent link

September 23, 2013

The Scope of Names (Introduction to Statistical Computing)

Undelivered optional lecture on Scope: R looks for the values of names in the current environment; if it cannot find a value, it looks for the name in the environment which spawned this one, and so on up the tree to the common, global environment. Assignment is modifying the name/value association list which represents the environment. The scope of a name is limited by the current environment. Implications: changes within the current scope do not propagate back to the larger environments; changes in the larger environment do propagate to all smaller ones which it encloses, unless over-ridden by local names. Subtlety: the larger environment for a function is the one in which it was defined, not the one in which it is called. Some implications for design. Examination of the last homework from this stance.

(I've decided not to actually give this lecture this year, in the interest of fitting in more substantive statistical and data-processing content, but also to post it for reference.)

Introduction to Statistical Computing

Posted by crshalizi at September 23, 2013 11:30 | permanent link

Debugging (Introduction to Statistical Computing)

Lecture 8, Debugging: Debugging as differential diagnosis: characterize the bug, localize it in the code, try corrections. Tactics for characterizing the bug. Tactics for localizing the bug: traceback, print, warning, stopifnot. Test cases and dummy input generators. Interactive debuggers. Programming with an eye to debugging: writing code with comments and meaningful names; designing the code in a top-down, modular, functional manner; writing and using tests. A hint at the exception-handling system.

Introduction to Statistical Computing

Posted by crshalizi at September 23, 2013 10:30 | permanent link

September 20, 2013

Lab: Like a Jackknife to the Heart (Introduction to Statistical Computing)

In which we meet the jackknife, by way of seeing how much error there is in our estimates from the last lab.

Lab 4 (R)

Introduction to Statistical Computing

Posted by crshalizi at September 20, 2013 10:30 | permanent link

September 19, 2013

Homework: Standard Errors of the Cat Heart (Introduction to Statistical Computing)

In which we meet the parametric bootstrap traveling incognito probe the precision of our estimation method from the last lab, by seeing how well it would work when the model is true and we know the parameters.

Assignment

Introduction to Statistical Computing

Posted by crshalizi at September 19, 2013 11:55 | permanent link

September 18, 2013

Testing (Introduction to Statistical Computing)

Lecture 7: Our code implements a method for solving problems we expect to encounter in the future; but why should we trust those solutions? We establish the reliability of the code by testing it. To respect the interfaces of the code, we test the substance of the answers, not the procedure used to obtain them, even though it is the reliability of the procedure we ultimate care about. We test both for the actual answer in particular cases and by cross-checking different uses of the same code which should lead to the same answer. Because we do not allow our tests to give us any false alarms, their power to detect errors is limited, and must be focused at particular kinds of errors. We make a virtue of necessity by using a diverse battery of tests, and shaping the tests so that they tell us where errors arise. The testing-programming cycle alternates between writing code and testing its correctness, adding new tests as new errors are discovered. The logical extreme of this is test-driven development, where tests represent the specification of the software's behavior in terms of practical consequences. Drawbacks of testing. Some pointers to more advanced tools for writing, maintaining and using tests in R.

(Why yes, this lecture was something of a lay sermon on epistemology.)

Introduction to Statistical Computing

Posted by crshalizi at September 18, 2013 10:30 | permanent link

September 16, 2013

Top-Down Design (Introduction to Statistical Computing)

Lecture 6: Top-down design is a recursive heuristic for solving problems by writing functions: start with a big-picture view of the problem; break it into a few big sub-problems; figure out how to integrate the solutions to each sub-problem; and then repeat for each part.

Top-down design forces you to think not just about the problem, but also about the method of solution, i.e., it forces you to think algorithmically; this is why it deserves to be part of your education in the liberal arts.

Exemplification: how we could write the lm function for linear regression, if it did not exist and it were necessary to invent it.

Additional optional reading: Herbert Simon, The Sciences of the Artificial.

Introduction to Statistical Computing

Posted by crshalizi at September 16, 2013 10:30 | permanent link

September 14, 2013

"Bayes and Big Data" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you care a lot about computational statistics.

For our first seminar of the year, we are very pleased to have a talk which will combine two themes close to the heart of the statistics department:

Steve Scott, "Bayes and Big Data"
Abstract: A useful definition of "big data" is data that is too big to fit on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be alleviated by splitting "big data" across multiple machines. Communication between large number of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging the individual Monte Carlo draws. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Examples of consensus Monte Carlo will be shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).
Time and place: 4--5 pm on Monday, 16 September 2013, in 1212 Doherty Hall

As always, the talk is free and open to the public.

— A slightly cynical historical-materialist take on the rise of Bayesian statistics is that it reflects a phase in the development of the means of computation, namely the PC era. The theoretical or ideological case for Bayesianism was pretty set by the early 1960s, say with Birnbaum's argument for the likelihood principle1. It nonetheless took a generation or more for Bayesian statistics to actually become common. This is because, under the material conditions of the early 1960s, such ideas could be only be defended and not applied. What changed this was not better theory, or better models, or a sudden awakening to the importance of shrinkage and partial pooling. Rather, it became possible to actually calculate posterior distributions. Specifically, Monte Carlo methods developed in statistical mechanics permitted stochastic approximations to non-trivial posteriors. These Monte Carlo techniques quickly became (pardon the expression) hegemonic within Bayesian statistics, to the point where I have met younger statisticians who thought Monte Carlo was a Bayesian invention2. One of the ironies of applied Bayesianism, in fact, is that nobody actually knows the posterior distribution which supposedly represents their beliefs, but rather (nearly3) everyone works out that distribution by purely frequentist inference from Monte Carlo samples. ("How do I know what I think until I see what the dice say?", as it were.)

So: if you could do Monte Carlo, you could work out (approximately) a posterior distribution, and actually do Bayesian statistics, instead of talking about it. To do Monte Carlo, you needed enough computing power to be able to calculate priors and likelihoods, and to do random sampling, in a reasonable amount of time. You needed a certain minimum amount of memory, and you needed clock speed. Moreover, to try out new models, to tweak specifications, etc., you needed to have this computing power under your control, rather than being something expensive and difficult to access. You needed, in other words, a personal computer, or something very like it.

The problem now is that while our computers keep getting faster, and their internal memory keeps expanding, our capacity to generate, store, and access data is increasing even more rapidly. This is a problem if your method requires you to touch every data point, and especially a problem if you not only have to touch every data point but do all possible pairwise comparisons, because, say, your model says all observations are dependent. This raises the possibility that Bayesian inference will become computationally infeasible again in the near future, not because our computers have regressed but because the size and complexity of interesting data sets will have rendered Monte Carlo infeasible. Bayesian data analysis would then have been a transient historical episode, belonging to the period when a desktop machine could hold a typical data set in memory and thrash through it a million times in a weekend.

Of course, I don't know that Bayesian inference is doomed to become obsolete because it will grow computationally intractable. One possibility is that "Bayesian inference" will be redefined in ways which depart further and further from the noble-Savage ideal, but are computationally tractable — variational Bayes, approximate Bayesian computation, and the generalized updating of Bissiri et al. are three (very different) moves in that direction. Another possibility is that algorithm designers are going to be clever enough to make distributed Monte Carlo approximations for posteriors as feasible as, say, a distributed bootstrap. This is, implicitly, the line Scott is pursuing. I wish him and those like him every success; whatever the issues with Bayesianism and some of its devotees, the statistical world would lose something valuable if Bayes as we know it were to diminish into a relic.

Update, 16 September 2013: It apparently needs saying that the ill-supported speculations here about the past and future of Bayesian computing are mine, not Dr. Scott's.

Update, 18 December 2013: "Asymptotically Exact, Embarrassingly Parallel MCMC" by Neiswanger et al. (arxiv:1311.4780) describes and analyses a very similar scheme to that proposed by Dr. Scott.

Manual trackback: Homoclinic Orbit

[1]: What Birnbaum's result actually proves is another story for another time; in the meanwhile, see Mayo, Evans and Gandenberger. ^

[2]: One such statistician persisted in this belief after reading Geyer and Thompson, and even after reading Metropolis et al., though there were other issues at play in his case. ^

[3]: The most interesting exception to this I know of is Rasmussen and Ghahramani's "Bayesian Monte Carlo" (NIPS, 2002). But despite its elegance and the reputations of its authors, it's fair to say this work has not had much impact. ^

Enigmas of Chance

Posted by crshalizi at September 14, 2013 23:22 | permanent link

September 13, 2013

Homework: Hitting Bottom and Calling for a Shovel (Introduction to Statistical Computing)

In which we see how to minimize the mean squared error when there are two parameters, in the process learning about writing functions, decomposing problems into smaller steps, testing the solutions to the smaller steps, and minimization by gradient descent.

Assignment

Introduction to Statistical Computing

Posted by crshalizi at September 13, 2013 11:30 | permanent link

Lab: Of Big- and Small- Hearted Cats (Introduction to Statistical Computing)

In which we practice the arts of writing functions and of estimating distributions, while contemplating just how little room there is in the heart of a cat.

Lab

Introduction to Statistical Computing

Posted by crshalizi at September 13, 2013 10:30 | permanent link

September 11, 2013

Writing Multiple Functions (Introduction to Statistical Computing)

Lecture 5: Using multiple functions to solve multiple problems; to sub-divide awkward problems into more tractable ones; to re-use solutions to recurring problems. Value of consistent interfaces for functions working with the same object, or doing similar tasks. Examples: writing prediction and plotting functions for power-law scaling models. Advantages of splitting big problems into smaller ones with their own functions: understanding, modification, design, re-use of work. Trade-off between internal sub-functions and separate functions. Re-writing the plotting function to use the prediction function. Recursion. Example: re-writing the resource allocation code to be more modular and recursive. R for examples.

Introduction to Statistical Computing

Posted by crshalizi at September 11, 2013 10:30 | permanent link

September 10, 2013

Writing and Calling Functions (Introduction to Statistical Computing)

Lecture 4: Just as data structures tie related values together into objects, functions tie related commands together into objects. Declaring functions. Arguments (inputs) and return values (outputs). Named arguments, defaults, and calling functions. Interfaces: controlling what the function can see and do; first sketch of scoping rules. The importance of the interface. An example of writing and improving a function, for fitting the West et al. model of urban economies by nonlinear least squares. R for examples.

Introduction to Statistical Computing

Posted by crshalizi at September 10, 2013 15:52 | permanent link

Homework: Tweaking Resource-Allocation-by-Tweaking (Introduction to Statistical Computing)

In which we make incremental improvements to our code for planning by incremental improvements.

Assignment, code.

Introduction to Statistical Computing

Posted by crshalizi at September 10, 2013 15:44 | permanent link

Lab: Only the Answers Have Changed (Introduction to Statistical Computing)

In which we try our hand at controlling the flow of our code, and learn about the reproducibility of randomized procedures.

Lab, R code

Introduction to Statistical Computing

Posted by crshalizi at September 10, 2013 15:42 | permanent link

September 04, 2013

Rainfall, Data Structures, Obsessive Doodling (Introduction to Statistical Computing)

In which we practice working with data frames, grapple with some of the subtleties of R's system of data types, and calculate the consequences of doodling while bored in lecture.

Assignment, due at 11:59 pm on Thursday, 5 September 2013

Introduction to Statistical Computing

Posted by crshalizi at September 04, 2013 01:43 | permanent link

Lab: Basic Probability, Basic Data Structures (Introduction to Statistical Computing)

In which we play around with basic data structures and convince ourself that the laws of probability are, in fact, right. (Or perhaps that R's random number generator is pretty good.)

Lab

Introduction to Statistical Computing

Posted by crshalizi at September 04, 2013 01:42 | permanent link

Flow Control, Iteration, Vectorizing (Introduction to Statistical Computing)

Lecture 3: Conditioning the calculation on the data: if; what is truth?; Boolean operators again; switch. Iteration to repeat similar calculations: for and iterating over a vector; while and conditional iteration (reducing for to while); repeat and unconditional iteration, with break to exit loops (reducing while to repeat). Avoiding iteration with "vectorized" operations and functions: the advantages of the whole-object view; some examples and techniques: mathematical operators and functions, ifelse; generating arrays with repetitive structure.

Introduction to Statistical Computing

Posted by crshalizi at September 04, 2013 01:41 | permanent link

Basic Data Types and Data Structures (Introduction to Statistical Computing)

Introduction to the course: statistical programming for autonomy, honesty, and clarity of thought. The functional programming idea: write code by building functions to transform input data into desired outputs. Basic data types: Booleans, integers, characters, floating-point numbers. Subtleties of floating point numbers. Operators as basic functions. Variables and names. An example with resource allocation. Related pieces of data are bundled into larger objects called data structures. Most basic data structures: vectors. Some vector manipulations. Functions of vectors. Naming of vectors. Continuing the resource-allocation example. Building more complicated data structures on top of vectors. Arrays as a first vector structure. Matrices as a special type of array; functions for matrix arithmetic and algebra: multiplication, transpose, determinant, inversion, solving linear systems. Using names to make calculations clearer and safer: resource-allocation mini-example. Lists for combining multiple types of values; access sub-lists, individual elements; ways of adding and removing parts of lists. Lists as key-value pairs. Data frames: the data structure for classic tabular data, one column per variable, one row per unit; data frames as hybrids of matrices and lists. Structures of structures: using lists recursively to creating complicated objects; example with eigen.

Slides

Introduction to Statistical Computing

Posted by crshalizi at September 04, 2013 01:40 | permanent link

Class Announcement: 36-350, Statistical Computing

It's that time of year again:

36-350, Statistical Computing
Description: Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.
Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.
The class will be taught in the R language.
Pre-requisites: This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential, as is some prior acquaintance with statistical software. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are "Computing at Carnegie Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).
The class may be unbearably redundant for those who already know a lot about programming. The class will be utterly incomprehensible for those who do not know statistics and probability.

Further details can be found at the class website. Teaching materials (lecture slides, homeworks, labs, etc.), will appear both there and here.

Corrupting the Young; Enigmas of Chance; Introduction to Statistical Computing

Posted by crshalizi at September 04, 2013 01:39 | permanent link

Self-Evaluation and Lessons Learned (Advanced Data Analysis from an Elementary Point of View)

Attention conservation notice: Navel-gazing about teaching.

Well, that was exhausting. Everything seemed to take me at least as much effort and time as the year before, despite recycling a lot of material.

There continues to not be enough manpower to give detailed feedback on 90-odd students' weekly data analysis homework.

I had to abandon the quality-control sampling early on the semester, because there just wasn't time on my end. I regret this.

The second exam didn't work well at all; it will need to be replaced in the future. The students' struggles with it suggest that I ought to teach more about categorical data, but I'm not sure what I'd cut to make room for it.

This is all pretty negative, and indeed I'm fairly dis-satisfied with my own performance in teaching this, since it feels like I worked harder and the students didn't learn any more. But the student evaluations for the class and for me were much higher this year than before. I don't know whether this means I've started doing something right without realizing it, or forgotten something important, or indeed changed how I teach at all. But even the kids who ended the course hating my guts claimed they'd learned from it, so I'll count that as a win.

(The main thing to do at this point is to actually finish the book. The biggest job is going to be integrating the homework problems and the exams into the text. The most tedious job is going to be making sure the notation is uniform throughout.)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at September 04, 2013 01:22 | permanent link

August 31, 2013

Books to Read While the Algae Grow in Your Fur, August 2013

Chelsea Cain, Kill You Twice and Let Me Go
Mind candy. The series is steadily ratcheting up its sheer gothic weirdness, and I'm not sure I can recommend these for new-comers, but they were fun.
Erin Morgenstern, The Night Circus
Mind candy: lovely literary fantasy about an enchanted traveling carnival, and the wizard's-duel/love-affair of two apprentice magicians. (Thanks to RCS for a copy.)
Daniel Schwartz, Travelling through the Eye of History
Beautiful contemporary photographs of Central Asia; the text strives for profoundity but only achieves pretentiousness.
Jessica Hagy, How to Be Interesting (In 10 Simple Steps)
Inspirational life advice. I think it's good advice for people who are in a position to follow it; too many people aren't.
Linda Nagata, The Red: First Light
Mind candy: cynical, intelligent and gripping near-future military hard SF. Stefan Raets's review at tor.com pre-empts most of what I wanted to say.
Tracy Thompson, The New Mind of the South
Journalistic impressions from a native daughter; despite the title, not really an attempt to update Cash. It's well-written (for journalism) and seems very plausible, but I lack the knowledge to say if it's right. (She doesn't seem to do enough comparison to the rest of the US, though, like the Midwest or the mountain West, when discussing the hollowing-out of rural areas.)
Nilanjana Roy, The Wildings
Mind candy. Fantasy novel about free-living cats in Delhi.
Kevin D. Hoover, Causality in Macroeconomics
Valuable, but I did want to argue with it a lot. In particular, I don't understand his objection to the causal Markov condition, and so to what he calls the approach of "Glymour et al." Saying it won't hold for some of the official economic statistics because they're measured too slowly confuses, by his own principles, the causal structure of the macroeconomy with the econometric issue of discovering that structure. Hoover's own positive principle for determining the direction of causation is that if \( X \) causes \( Y \), interventions which change the distribution of \( X \) shouldn't alter the conditional distribution of \( Y \) given \( X \). If the causality goes the other way, however, interventions which alter the distribution of \( X \) will also change \( Y|X \). This is sound, but also easily explicated within the graphical-model tradition. (It's how we calculate effects of interventions from "surgery" on graphs.)
J. Robert Lennon, Castle
Mind candy. Extraordinarily creepy thriller, in which the headgames played on the narrator echo those played on the reader. (Thanks to Larry for a copy.)
Avram Davidson, The Wailing of the Gaulish Dead
A lost "adventure in unhistory", thankfully restored to the shores of light.
Kristin Landon, The Hidden Worlds
Mind candy. Enjoyably angsty romantic science fiction. (Many fewer ray-guns than the cover suggests.)
Caitlin R. Kiernan, Alabaster: Wolves
Comic-book mind candy, featuring a minor character from Threshold.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Philosophy; Writing for Antiquity; Constant Conjunction Necessary Connection; The Dismal Science; Afghanistan and Central Asia; The Beloved Republic

Posted by crshalizi at August 31, 2013 23:59 | permanent link

July 31, 2013

Books to Read While the Algae Grow in Your Fur, July 2013

Attention conservation notice: I have no taste.

Howard Andrew Jones, The Bones of the Old Ones
Mind candy. In which our heroes battle the malign spirits of the ice age. (Previously.)
Warren Ellis, Global Frequency
Mind candy. I remember liking this a lot when it first came out, but reading the re-issue I can't recall why; it seems like Ellis recycling themes he dealt with much better elsewhere.
László Lovász, Large Networks and Graph Limits
Lovász is one of the inventors and main developers of the theory of graph limits; I will not attempt to explain that here, but refer to my notebook on the subject. This is his most systematic and comprehensive account of the theory, and indeed the fullest account I know of. It's very much a mathematician's book, rather than, say, a computer scientist's, physicist's, or statistician's — I confess I still don't see the point of some of the most purely algebraic parts — but extremely clear for anyone with the necessary firm grounding, which should include graph theory, especially the theory of random graphs, some real analysis and probability (enough to appreciate weak and \( L_1 \) convergence and why they differ), and a fair chunk of abstract algebra.
Brian Hayes, Infrastructure: A Field Guide to the Industrial Landscape
How-stuff-works for grown-ups, guided by an aesthetic of fascination and awe, and integrally bound up to lovely pictures.
(This book came out years ago, but I have been reading it very slowly, to savor. Like Adina Levin, I'd like to see it "appear 50 years from now like a tour guide to Colonial Williamsburg".)
Mark Charan Newton, Nights of Viljamur
Mind candy: Yet Another Epic Fantasy. Picked up because of the dying-earth setting, finished because I started it. (I didn't find either the characterization or the politics convincing, and there was a lot of both.) I won't be looking for sequels.
Karin Slaughter, Unseen
Mind candy. The first time in years I'd figured out whodunnit long before the end.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Networks; Mathematics; Enigmas of Chance

Posted by crshalizi at July 31, 2013 23:59 | permanent link

June 30, 2013

Books to Read While the Algae Grow in Your Fur, June 2013

Attention conservation notice: I have no taste.

Stanislaw Lem, Summa Technologiae (trans. Joanna Zylinska)
This was Lem's main attempt to think through the relations between human intellect and human technology. It's not a book with a central thesis so much as a series of explorations along related lines, with Lem pushing various ideas as far as he can go. This means that it considers information, feedback, self-organization, complexity, biological evolution, evolutionary optimization algorithms, artificial life, complexity, virtual reality, the social organization of scientific research, the extent to which public policy can guide technological development, the search for extraterrestrial intelligence, "miracles in the heavens", the ethical implications of duplicating people by transmitting descriptions of them, the minimum description length principle, and much, much else.
The themes Lem keeps returning to are the comparison between human cognition and natural evolution, the importance of self-organization and homeostasis, and the possibilities of one process creating another of a radically different nature. All three themes are clearly very influenced by cybernetics, especially Wiener (e.g., God & Golem, Inc.) and Ashby (e.g., Design for a Brain). In some ways the influence makes this a bit of a period piece, but as always Lem had good taste in ideas, and isn't distracted by the less valuable parts of cybernetics. This means that while some of Lem's ideas about implementations are obsolete, his general notions are still very relevant. (What is this but a step towards an information farm?) I'd go on, but David Auerbach's review is already out there, so just go read that, and then Summa.
(Please ignore the attempt, in the introduction to the translation, to push Lem as the Next Big Thing in Contnential theory; we owe the translator a debt for her labor of love on this book.)
Ben Aaronovitch, Midnight Riot (= Rivers of London), Moon over Soho and Whispers Underground
Mind candy. Adventures and investigations of London constable Peter Grant, apprentice sorcerer. I learned about these from Kate Nepveu's review, and devoured them.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; The Great Transformation; Complexity

Posted by crshalizi at June 30, 2013 23:59 | permanent link

May 31, 2013

Books to Read While the Algae Grow in Your Fur, May 2013

Attention conservation notice: I have no taste.

Jeannine Hall Gailey, Unexplained Fevers
More fairytales retold, as modern verse. (Previously: 1, 2.)
Jay Lake, Kalimpura
Mind candy; the conclusion of Green's story. The writing continues to be excellent, but I found the ending a bit too easy. (Previously: 1, 2.)
John Scalzi, The Human Division
Mind candy. Fun if you've been enjoying the series, not worthwhile otherwise. (Previously: 1, 2 and 3.)
Chrystia Freeland, Plutocrats: The Rise of the New Global Super-Rich and the Fall of Everyone Else
High quality popular social science about rising inequality, primarily in the US but also giving a lot of attention to China, Russia and India.
Mary E. Burfisher, Introduction to Computable General Equilibrium Models
Computable general equilibrium models in economics work something like this. There are several types of consumers, who spend their income on different types of goods, with the division across goods depending on their absolute and relative prices. (There is the usual mythology about how this is from utility maximization, but all that's even close to empirical here are the demand functions, which are aggregate demands from vast numbers of consumers, so individual utility is superfluous.) Producers of goods in turn adjust quantities produced in response to prices. (Again, there's mythology about utility.) Generally speaking, prices are adjusted so that demand equals supply and all markets clear, especially all markets for factors of production. (Alternately, if one wants to admit that unemployment happens, one has to hold wages constant "by hand".) The income which goes to goods producers is then divided up similarly among factors of production, and thence to consumers. Everything balances, so that every dollar spent on one side of a transaction is received on the other. (Including taxes and foreign trade complicates things only a little.)
These models are very widely used for policy, with a typical exercise being to "shock" some price or quantity or tax rate, and purport to calculate the effects. This book is an introduction to their use, for an implied reader who mostly remembers a basic microeconomic course but otherwise has little economic sophistication, and less math. (They also need a lot of hand-holding.) Since I am very far from the implied reader, it's hard for me to judge how well it would teach them to plug-and-chug in such models, but I suspect it will work at least OK. It will certainly not teach them how to think critically about the model's output.
It is really quite astonishing that one and the same economics profession can simultaneously proclaim the importance of microfoundations, and elaborate and accept models like these. (Cf.) It is really quite scary to think that important decisions get made using models this crude. It is also scary to think that decision-makers are getting advice from people who understand economic models so little as to find this book useful.
Matt Fraction, David Aja and Javier Pullido, Hawkeye: My Life as a Weapon
Comic-book mind candy.
George Polya, How To Solve It: A New Aspect of Mathematical Method
The book which started the revival of "heuristics" (I think). I'd never read it before; I see why it's a classic, I'll certainly borrow from it (and recommend it) for teaching purposes. But — and I realize that this is a person of very small accomplishments criticizing a great mathematician — there is a school-room air to the sort problem-solving Polya describes. "Have you used all the data?" and "Have you used all of the condition?" are good pieces of advice for a pedagogical problem which is (as Polya says) "perfectly posed", with no superfluities or missing parts, and a fixed goal. Getting to that point is much of the work of doing original mathematics, let alone empirical science, or even original data analysis. Now, to be fair, Polya realizes this (see his entry on "practical problems"), but it still seems like a huge gap. — I will assign this to the Kids in the research projects course in the spring, but it won't really be what they need.
Charles Tilly, Durable Inequality
Tilly's attempt to sketch out the families of mechanisms which generate and maintain categorical inequalities, as between races, or sexes, or nationalities, or ethnicities. It deserves a full review but I can't give it one.
(That said, I found the discussion of opportunity hoarding much more satisfying than that of exploitation. It's not clear to me how, analytically, we are supposed to recognize exploitation, as opposed to fair-but-unequal rewards. It's also not clear to me that it's really a mechanism, in the sense Tilly (very sensibly) wants to find, i.e., a recurring pattern of causal interaction.)

Books to Read While the Algae Grow in Your Fur; The Dismal Science; Scientifiction and Fantastica; Commit a Social Science; Mathematics; Corrupting the Young; The Commonwealth of Letters

Posted by crshalizi at May 31, 2013 23:59 | permanent link

April 30, 2013

Books to Read While the Algae Grow in Your Fur, April 2013

Attention conservation notice: I have no taste.

Seanan McGuire, Midnight Blue-Light Special
Mind candy, continuing the story from Discount Armageddon.
"Paula Brandon", The Traitor's Daughter, The Ruined City, The Wanderers
Mind candy; but mind candy worthy of comment.
"Brandon" is, if I can trust Wikipedia, a pseudonym of Paula Volsky. To indulge in a bit of self-plagiarism, Volsky is herself the poor person's Jack Vance: she has a similar talent for creating self-referential social worlds which are very compelling to their inhabitants yet provide endless possibilities for irony to the reader, and for placing grandiloquent rhetoric in subtly inappropriate mouths, without ever quite reaching the same pitch of accomplishment as Vance. Since Vance is, by common consent, one of the greatest stylists in science fiction and fantasy, I do not regard this as damning with faint praise at all; to be almost as good a Vancean as Vance is really remarkable. Volsky's novels have given me with a great deal of unalloyed pleasure, but no more appeared after her (very good) The Grand Ellipse in 2000.
The Traitor's Daughter is a return to form, with a complex multi-stranded plot set in a well-realized fantasy world, whose social intricacies and harsh realities are of primary concern to the viewpoint characters, even as something much vaster — a novel sort of zombie apocalypse rooted simultaneously in Lovecraft and Childhood's End — unfolds around them. The major protagonist, Aureste Belandor, is, in his combination of determination to not just come first but to dominate, scheming manipulation of everyone around him, and sheer grasping self-centeredness, one of the most vivid characters Volsky has ever created. (He ranks with Vance's Cugel the Clever, though the comparison would pain Aureste.) His affection for, and anguish over, his kidnapped daughter Jianna is portrayed quite convincingly --- and it comes across as being driven by his seeing her as a possession, or, perhaps more charitably, as an extension of himself. But while Aureste is unquestionably the most vivid character here, they are all compelling. I recommend this series very strongly, if this makes it sound at all interesting.
(Inexplicably, or only too explicably, the blurbs for the books makes it sound like they focus on Jianna's romance sub-plot, which, while well done in its way, isn't even the focus of Jianna's part of the story.)

Books to Read While the Algae Grow in Your Fur Scientifiction and Fantastica

Posted by crshalizi at April 30, 2013 23:59 | permanent link

Final Exam (Advanced Data Analysis from an Elementary Point of View)

In which we are devoted to two problems of political economy, viz., strikes, and macroeconomic forecasting.

Assignment; macro.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 30, 2013 11:50 | permanent link

Time Series (Advanced Data Analysis from an Elementary Point of View)

What time series are. Properties: autocorrelation or serial correlation; other notions of serial dependence; strong and weak stationarity. The correlation time and the world's simplest ergodic theorem; effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.

Reading: Notes, chapter 25; R for examples; gdp-pc.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 30, 2013 10:30 | permanent link

April 25, 2013

Discovering Causal Structure from Observations (Advanced Data Analysis from an Elementary Point of View)

How do we get our causal graph? Comparing rival DAGs by testing selected conditional independence relations (or dependencies). Equivalence classes of graphs. Causal arrows never go away no matter what you condition on ("no causation without association"). The crucial difference between common causes and common effects: conditioning on common causes makes their effects independent, conditioning on common effects makes their causes dependent. Identifying colliders, and using them to orient arrows. Inducing orientation to enforce consistency. The SGS algorithm for discovering causal graphs; why it works. The PC algorithm: the SGS algorithm for lazy people. What about latent variables? Software: TETRAD and pcalg; examples of working with pcalg. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not.

Reading: Notes, chapter 24

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 25, 2013 10:30 | permanent link

April 24, 2013

"Going Online: Should we do it? How? Why? What do we gain? What do we lose?" (Next Week, Instead of the Statistics Seminar)

Next week, instead of the regular seminar, the CMU statistics department will be hosting a panel on experience with online statistics education, including massive open online courses:

"Going Online: Should we do it? How? Why? What do we gain? What do we lose?"
Panelists: Emma Brunskill; Brian Caffo; Jeff Leek; Marsha Lovett; Roger Peng; Chad Schafer
Moderators: Rebecca Nugent and Ryan Tibshirani
Time and place: 3:30--5:00 pm on Monday, 29 April 2013, in Baker Hall A51 ("Giant Eagle Auditorium")
I look forward to this very much, even if I do plan to channel Tim Burke and/or Adam Kotsko.

Corrupting the Young

Posted by crshalizi at April 24, 2013 22:54 | permanent link

April 23, 2013

Growth and Debt (Advanced Data Analysis from an Elementary Point of View)

In which the relationship (if any) between GDP growth and government debt forms a bridge between causal inference and time series analysis.

Assignment, debt.csv

Update, 2 May: I have given up on posting solutions, but in this case I will make an exception, as well as directing the reader to the relevant comment form Biostatistics Ryan Gosling.

Posted by crshalizi at April 23, 2013 11:50 | permanent link

Estimating Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)

Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.

Reading: Notes, chapter 23

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 23, 2013 10:30 | permanent link

April 16, 2013

Brought to You by the Letters D, A, and G (Advanced Data Analysis from an Elementary Point of View)

In which the arts of estimating causal effects from observational data are practiced on Sesame Street.

Assignment, sesame.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 16, 2013 11:50 | permanent link

Identifying Causal Effects from Observations (Advanced Data Analysis from an Elementary Point of View)

Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.

Reading: Notes, chapter 22

Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 16, 2013 10:30 | permanent link

April 11, 2013

Graphical Causal Models (Advanced Data Analysis from an Elementary Point of View)

Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.

Reading: Notes, chapter 21

Optional reading: Cox and Donnelly, chapter 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 11, 2013 10:30 | permanent link

April 09, 2013

Choosing a Better History (Advanced Data Analysis from an Elementary Point of View)

Exam 2: in which we examine how the citizens of ex-communist country X look at history and human rights, as a way of practicing multivariate data analysis.

Assignment; the data set is still confidential and so not public.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 09, 2013 11:50 | permanent link

Graphical Models (Advanced Data Analysis from an Elementary Point of View)

Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?

Reading: Notes, chapter 20

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 09, 2013 10:30 | permanent link

April 04, 2013

"Prediction in Complex Networks" (Next Week at the Statistics Seminar)

All of the statistics department's seminars are, of course, fascinating presentations of important work, but next week's could hardly be more relevant to my interests if I had arranged it myself.

Jennifer Neville, "Prediction in complex networks: The impact of structure on learning and prediction"
Abstract: The recent popularity of online social networks and social media has increased the amount of information available about users' behavior — including current activities and interactions among friends and family. This rich relational information can be exploited to predict user interests and preferences even when individual data is sparse, as the relationships are a critical source of information that identify potential statistical dependencies among people. Although network data offer several opportunities to improve prediction, the characteristics of real world datasets present a number of challenges to accurately incorporate relational information into machine learning algorithms. In this talk, I will discuss the effects of sampling, parameter tying, and model roll-out on the properties of the resulting statistical models — which occurs through a complex interaction between local model properties, global network structure, and the availability of observed attributes. By understanding the impact of these interactions on algorithm performance (e.g., learning, inference, and evaluation), we can develop more accurate and efficient analysis methods for large, partially-observable social network and social media datasets.
Place and time: Scaife Hall 125, 4--5 pm on Monday, 8 April 2013

Enigmas of Chance; Networks

Posted by crshalizi at April 04, 2013 13:29 | permanent link

April 02, 2013

Mixture Models (Advanced Data Analysis from an Elementary Point of View)

From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.

Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.

Reading: Notes, chapter 19; mixture-examples.R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at April 02, 2013 10:30 | permanent link

March 31, 2013

Books to Read While the Algae Grow in Your Fur, March 2013

Attention conservation notice: I have no taste.

John Levi Martin, Social Structures
The best approach to a theory of social networks I have ever seen from the hands of a sociologist. Specifically, it is about relating the content of different social relationships to the form of larger network to which they give rise, and how both form and content are linked to the ways participants think about the relationship — but not just to how they think. Martin draws very deeply on a huge range of scholarship, everything from studies of American teenagers at summer camps through the ethology of dominance hierarchies (including the origins of the concept "pecking order") to the history of European militaries and the development of party politics in colonial New York and Virginia. Astonishingly, he really pulls it all together, and writes much more than decently.
To over-simplify a lot, Martin wants to identify what allows some types of relationships to ramify into very large social structures, where the participants can nonetheless have some grasp of the structure and how it is organized. (It's not enough if the organization only becomes evident to an outsider after detailed sociometric analysis.) Principles like homophily don't work, because they would lead merely to cliques, or at best to dense clusters, and people are simply unable to handle dense social networks at any large scale; real networks are, and must be, sparse. "Balance" — the transitive closure of the idea that "the enemy of my enemy is my friend, and the friend of my enemy is my enemy" fails because it's strategic suicide. Relationships of exchange (of gifts, children in marriage, etc.) are more promising, but are also necessarily fragile, and hard to expand. Human beings simply don't do pecking orders, unless they are confined without possibilities of escape (like schoolchildren), and even then sorting a truly large pool of people by peck-ability doesn't scale.
The most robust possibility for creating large-scale, comprehensible social structures is an anti-symmetric patron-client relationship, grounded in some inequality pre-existing inequality of status or resources (or ideally both), where clients offer services to patrons and patrons protect clients from other members of the patron class. If patrons can be clients of more important patrons, such ties can be concatenated into vast pyramids. In doing so, they do not lose comprehensibility (everyone remains a client of a single patron, who is superior on some recognized dimension), or requiring dense networks, or a carefully balanced flow of resources, or vast efforts on the part of participants. Patronage is not a transitive relationship (my patron's patron is not my patron, whom I must serve), but patronage pyramids can as it were harden into command hierarchies, where subordination is transitive; Martin explores the role of this process in creating the modern corporation, army, political party, and state. (On the corporation, he is particularly good [pp. 236--241] on the fact that many people like being in control, and technical or economic efficiency be damned.)
To some extent, I feel that he leaves off just where things are getting interesting, by mentioning that in the case of parties, people create transitive relationships with each other by imagining that they have a binary relationship with the party, or perhaps with its ideology. Parties and ideologies, in this sense, are, to use the poet's words, "consensual hallucinations", but nonetheless very important to how any really large social organization happens. I wish this book said more about how they worked, but Martin seems to treat them as black boxes. (The phrase "imagined community" does not, to the best of my recollection, ever appear in the text.) I hope that this will be dealt with in a later book.
(Read on Kieran Healy's recommendation)
Update: "JLM" has a page of replies to reviews; the tone of the book itself is rather more staid.
Elizabeth Bear, Shattered Pillars
Sequel to the magnificent Range of Ghosts, continuing the action where that book left off. And there is a lot of action: escapes, betrayals, ambushes, storms, fires, eruptions, un-natural plagues, miraculous births; also cities (variously thriving, burning, seething with discontent, rebuilding and ruined), wondrous beasts, ghouls, forbidden books, slave poetesses, the intersection of chemical thermodynamics with wizardry, and cramped tunnels and vast skies. As an act of story-telling, what strikes me most, having just put the book down, is how much of the novel is told from perspectives of those who are (not to put too fine a point on it) villains, yet in such a way that the reader is invited to comprehend and even sympathize, though not approve. (Imagine key parts of The Two Towers being told from the perspective of Grima Wormtongue, who thought he was shouldering a burden by doing unpleasant things for the good of Rohan.) At the same time, Temur and Samarkar only grow on me as heroes. There will be a third book, for which I can hardly wait.
Elizabeth Bear, Bone and Jewel Creatures
Mind candy about dueling necromancers, but candy of a high grade. (How often, in mind candy or elsewhere, is the protagonist an old woman with arthritis?) In the same world as Range of Ghosts and Shattered Pillars, but thematically distinct.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Networks; Commit a Social Science

Posted by crshalizi at March 31, 2013 23:59 | permanent link

March 28, 2013

Factor Analysis (Advanced Data Analysis from an Elementary Point of View)

Adding noise to PCA to get a statistical model. The factor model, or linear regression with unobserved independent variables. Assumptions of the factor model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. Our first look at latent variables and conditional independence. Geometrically, the factor model says the data cluster on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.

Reading: Notes, chapter 18; factors.R and sleep.txt

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 28, 2013 10:30 | permanent link

March 26, 2013

How the Recent Mammals Got Their Size Distribution (Advanced Data Analysis from an Elementary Point of View)

Homework 8: in which returning to paleontology gives us an excuse to work with simulations, and to compare distributions.

Assignment; MOM_data_full.txt

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 26, 2013 11:50 | permanent link

Principal Components Analysis (Advanced Data Analysis from an Elementary Point of View)

Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.

Reading: Notes, chapter 17; pca.R, pca-examples.Rdata, and cars-fixed04.dat

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 26, 2013 10:30 | permanent link

March 21, 2013

Relative Distributions and Smooth Tests (Advanced Data Analysis from an Elementary Point of View)

Applying the correct CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.

Reading: Notes, chapter 16

Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics"; Handcock and Morris, "Relative Distribution Methods"

P.S.: nauroz mubarak!

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 21, 2013 01:54 | permanent link

March 20, 2013

Red Brain, Blue Brain (Advanced Data Analysis from an Elementary Point of View)

Homework 7: in which we try to predict political orientation from bumps on the skull the volume of brain regions determined by MRI and adjusted by (unknown) formulas.

Assignment, n90_pol.csv data

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 20, 2013 20:30 | permanent link

March 19, 2013

Density Estimation (Advanced Data Analysis from an Elementary Point of View)

The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties: convergence on the true density if the bandwidth shrinks at the right rate; superior performance to histograms; the curse of dimensionality again. An example with cross-country economic data. Kernels for discrete variables. Estimating conditional densities; another example with the OECD data. Some issues with likelihood, maximum likelihood, and non-parametric estimation. Simulating from kernel density estimates and from histograms.

Reading: Notes, chapter 15

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 19, 2013 10:30 | permanent link

March 18, 2013

Estimating by Minimizing

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#2}\left[ #1 \right]} \newcommand{\Varwrt}[2]{\mathrm{Var}_{#2}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} \]

Attention conservation notice: > 1800 words on basic statistical theory. Full of equations, and even a graph and computer code, yet mathematically sloppy.

The basic ideas underlying asymptotic estimation theory are very simple; most presentations rather cloud the basics, because they include lots of detailed conditions needed to show rigorously that everything works. In the spirit of the world's simplest ergodic theorem, let's talk about estimation.

We have a statistical model, which tells us, for each sample size \( n \), the probability that the observations \( X_1, X_2, \ldots X_n \equiv X_{1:n} \) will take on any particular value \( x_{1:n} \), or the probability density if the observables are continuous. This model contains some unknown parameters, bundled up into a single object \( \theta \), which we need to calculate those probabilities. That is, the model's probabilities are \( m(x_{1:n};\theta) \), not just \( m(x_{1:n}) \). Because this is just baby stats., we'll say that there are only a finite number of unknown parameters, which don't change with \( n \), so \( \theta \in \mathbb{R}^d \). Finally, we have a loss function, which tells us how badly the model's predictions fail to align with the data: \[ \lambda_n(x_{1:n}, m(\cdot ;\theta)) \] For instance, each \( X_i \) might really be a \( (U_i, V_i) \) pair, and we try to predict \( V_i \) from \( U_i \), with loss being mean-weighted-squared error: \[ \lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\frac{{\left( v_i - \Expectwrt{V_i|U_i=u_i}{\theta}\right)}^2}{\Varwrt{V_i|U_i=u_i}{\theta}}} \] (If I don't subscript expectations \( \Expect{\cdot} \) and variances \( \Var{\cdot} \) with \( \theta \), I mean that they should be taken under the true, data-generating distribution, whatever that might be. With the subscript, calculate assuming that the \( m(\cdot; \theta) \) distribution is right.)

Or we might look at the negative mean log likelihood, i.e., the cross-entropy, \[ \lambda_n = -\frac{1}{n}\sum_{i=1}^{n}{\log{m(x_i|x_{1:i-1};\theta)}} \]

Being simple folk, we try to estimate \( \theta \) by minimizing the loss: \[ \widehat{\theta}_n = \argmin_{\theta}{\lambda_n} \]

We would like to know what happens to this estimate as \( n \) grows. To do this, we will make two assumptions, which put us at the mercy of two sets of gods.

The first assumption is about what happens to the loss functions. \( \lambda_n \) depends both on the parameter we plug in and on the data we happen to see. The later is random, so the loss at any one \( \theta \) is really a random quantity, \( \Lambda_n(\theta) = \lambda_n(X_{1:n},\theta) \). Our first assumption is that these random losses tend towards non-random limits: for each \( \theta \), \[ \Lambda_n(\theta) \rightarrow \ell(\theta) \] where \( \ell \) is a deterministic function of \( \theta \) (and nothing else). It doesn't particularly matter to the argument why this is happening, though we might have our suspicions1, just that it is. This is an appeal to the gods of stochastic convergence.

Our second assumption is that we always have a unique interior minimum with a positive-definite Hessian: with probability 1, \[ \begin{eqnarray*} \nabla {\Lambda_n}(\widehat{\theta}_n) & = & 0\\ \nabla \nabla \Lambda_n (\widehat{\theta}_n) & > & 0 \end{eqnarray*} \] (All gradients and other derviatives will be with respect to \( \theta \); the dimensionality of \( x \) is irrelevant.) Moreover, we assume that the limiting loss function \( \ell \) also has a nice, unique interior minimum at some point \( \theta^* \), the minimizer of the limiting, noise-free loss: \[ \begin{eqnarray*} \theta^* & = & \argmin_{\theta}{\ell}\\ \nabla {\ell}(\theta^*) & = & 0\\ \nabla \nabla \ell (\theta^*) & > & 0 \end{eqnarray*} \] Since the Hessians will be important, I will abbreviate \( \nabla \nabla \Lambda_n (\widehat{\theta}_n) \) by \( \mathbf{H}_n \) (notice that it's a random variable), and \( \nabla \nabla \ell (\theta^*) \) by \( \mathbf{j} \) (notice that it's not random).

These assumptions about the minima, and the derivatives at the minima, place us at the mercy of the gods of optimization.

To see that these assumptions are not empty, here's an example. Suppose that our models are Pareto distributions for \( x \geq 1 \), \( m(x;\theta) = (\theta - 1) x^{-\theta} \). Then \( \lambda_n(\theta) = \theta \overline{\log{x}}_n - \log{(\theta - 1)} \), where \( \overline{\log{x}}_n = n^{-1} \sum_{i=1}^{n}{\log{x_i}} \), the sample mean of the log values. From the law of large numbers, \( \ell(\theta) = \theta \Expect{\log{X}} - \log{(\theta-1)} \). To show the convergence, the figure plots \( \lambda_{10} \), \( \lambda_{1000} \) and \( \lambda_{10^5} \) for a particular random sample, and the corresponding \( \ell \). I chose this example in part because the Pareto distribution is heavy tailed, and I actually generated data from a parameter value where the variance of \( X \) is infinite (or undefined, for purists). The objective functions, however, converge just fine.

Convergence of objective functions, here, negative average log-likelihoods. (Click on the image for the generating R code.) Note that the limiting, \( n = \infty \) objective function (solid blue line) is extremely close to what we see at \( n = 10^5 \) already. (See here, or more generally here, for pareto.R.)

With these assumptions made, we use what is about the only mathematical device employed in statistical theory at this level, which is a Taylor expansion. Specifically, we expand the gradient \( \nabla \Lambda_n \) around \( \theta^* \): \[ \begin{eqnarray*} \nabla {\Lambda_n}(\widehat{\theta}_n) & = & 0\\ & \approx & \nabla {\Lambda_n}(\theta^*) + \mathbf{H}_n(\widehat{\theta}_n - \theta^*)\\ \widehat{\theta}_n & = & \theta^* - \mathbf{H}_n^{-1}\nabla {\Lambda_n}(\theta^*) \end{eqnarray*} \] The first term on the right hand side, \( \theta^* \), is the population/ensemble/true minimizer of the loss. If we had \( \ell \) rather than just \( \Lambda_n \), we'd get that for the location of the minimum. But since we see \( \ell \) through a glass, darkly corrupted by noise, we need to include the extra term \( - \mathbf{H}_n^{-1}\nabla {\Lambda_n}(\theta^*) \). The Hessian \( \mathbf{H}_n \) tells us how sharply curved \( \Lambda_n \) is near its minimum; the bigger this is, the easier, all else being equal, to find the location of the minimum. The other factor, \( \nabla {\Lambda_n}(\theta^*) \), indicates how much noise there is — not so much in the function being minimized, as in its gradient, since after all \( \nabla {\ell}(\theta^*) = 0 \). We would like \( \widehat{\theta}_n \rightarrow \theta^* \), so we have to convince ourselves that the rest is asymptotically negligible, that \( \mathbf{H}_n^{-1}\nabla {\Lambda_n}(\theta^*) = o(1) \).

Start with the Hessian. \( \mathbf{H}_n \) is the matrix of second derivatives of a random function: \[ \mathbf{H}_n(\widehat{\theta}_n) = \nabla \nabla \Lambda_n(\widehat{\theta}_n) \] Since \( \Lambda_n \rightarrow \ell \), it would be reasonable to hope that \[ \mathbf{H}_n(\widehat{\theta}_n) \rightarrow \nabla \nabla \ell(\widehat{\theta}_n) = \mathbf{j}(\widehat{\theta}_n) \] We'll assume that everything is nice ("regular") enough to let us "exchange differentiation and limits" in this way. Since \( \mathbf{H}_n(\widehat{\theta}_n) \) is tending to \( \mathbf{j}(\widehat{\theta}_n) \), it follows that \( \mathbf{H}_n = O(1) \), and consequently \( \mathbf{H}_n^{-1} = O(1) \) by continuity. With more words: since \( \Lambda_n \) is tending towards \( \ell \), the curvature of the former is tending towards the curvature of the latter. But this means that the inverse curvature is also stabilizing.

Our hope then has to be the noise-in-the-gradient factor, \( \nabla {\Lambda_n}(\theta^*) \). Remember again that \[ \nabla \ell (\theta^*) = 0 \] and that \( \Lambda_n \rightarrow \ell \). So if, again, we can exchange taking derivatives and taking limits, we do indeed have \[ \nabla {\Lambda_n}(\theta^*) \rightarrow 0 \] and we're done. More precisely, we've established consistency: \[ \widehat{\theta}_n \rightarrow \theta^* \]

Of course it's not enough just to know that an estimate will converge, one also wants to know something about the uncertainty in the estimate. Here things are mostly driven by the fluctuations in the noise-in-the-gradient term. We've said that \( \nabla {\Lambda_n}(\theta^*) = o(1) \); to say anything more about the uncertainty in \( \widehat{\theta}_n \), we need to posit more. It is very common to be able to establish that \( \nabla {\Lambda_n}(\theta^*) = O(1/\sqrt{n}) \), often because \( \Lambda_n \) is some sort of sample- or time- average, as in my examples above, and so an ergodic theorem applies. In that case, because \( \mathbf{H}_n^{-1} = O(1) \), we have \[ \widehat{\theta}_n - \theta^* = O(1/\sqrt{n}) \]

If we can go further, and find \[ \Var{\nabla {\Lambda_n}(\theta^*)} = \mathbf{k}_n \] then we can get a variance for \( \widehat{\theta}_n \), using propagation of error: \[ \begin{eqnarray*} \Var{\widehat{\theta}_n} & = & \Var{\widehat{\theta}_n - \theta^*}\\ & = & \Var{-\mathbf{H}_n^{-1} \nabla {\Lambda_n}(\theta^*)}\\ & \approx & \mathbf{j}^{-1}(\theta^*) \Var{ \nabla {\Lambda_n}(\theta^*)} \mathbf{j}^{-1}(\theta^*)\\ & = & \mathbf{j}^{-1} \mathbf{k}_n \mathbf{j}^{-1} \end{eqnarray*} \] the infamous sandwich covariance matrix. If \( \nabla {\Lambda_n}(\theta^*) = O(1/\sqrt{n} ) \), then we should have \( n \mathbf{k}_n \rightarrow \mathbf{k} \), for a limiting variance \( \mathbf{k} \). Then \( n \Var{\widehat{\theta}_n} \rightarrow \mathbf{j}^{-1} \mathbf{k} \mathbf{j}^{-1} \).

Of course we don't know \( \mathbf{j}(\theta^*) \), because that involves the parameter we're trying to find, but we can estimate it by \( \mathbf{j}(\widehat{\theta}_n) \), or even by \( \mathbf{H}_n^{-1}(\widehat{\theta}_n) \). That still leaves getting an estimate of \( \mathbf{k}_n \). If \( \Lambda_n \) is an average over the \( x_i \), as in my initial examples, then we can often use the variance of the gradients at each data point as an estimate of \( \mathbf{k}_n \). In other circumstances, we might actually have to think.

Finally, if \( \nabla \Lambda_n(\theta^*) \) is asymptotically Gaussian, because it's governed by a central limit theorem, then so is \( \widehat{\theta}_n \), and we can use Gaussians for hypothesis testing, confidence regions, etc.

A case where we can short-circuit a lot of thinking is when the model is correctly specified, so that the data-generating distribution is \( m(\cdot;\theta^*) \), and the loss function is the negative mean log-likelihood. (That is, we are maximizing the likelihood.) Then the negative of the limiting Hessian \( \mathbf{j} \) is the Fisher information. More importantly, under reasonable conditions, one can show that \( \mathbf{j} = \mathbf{k} \), that the variance of the gradient is exactly the limiting negative Hessian. Then the variance of the estimate simplifies to just \( \mathbf{j}^{-1} \). This turns out to actually be the best variance you can hope for, at least with unbiased estimators (the Cramér-Rao inequality). But the bulk of the analysis doesn't depend on the fact that we're estimating by maximum likelihood; it goes the same way for minimizing any well-behaved objective function.

Now, there are a lot of steps here where I am being very loose. (To begin with: In what sense is the random function \( \Lambda_n \) converging on \( \ell \), and does it have to converge everywhere, or just in a neighborhood of \( \theta^* \)?) That is, I am arguing like a physicist. Turning this sketch into a rigorous argument is the burden of good books on asymptotic statistics. But this is the core. It does not require the use of likelihood, or correctly specified models, or independent data, just that the loss function we minimize be converging, in a well-behaved way, to a function which is nicely behaved around its minimum.

Further reading:

[1]: "In fact, all epistemologic value of the theory of the probability is based on this: that large-scale random phenomena in their collective action create strict, nonrandom regularity." — B. V. Gnedenko and A. N. Kolmogorov, Limit Distributions for Sums of Independent Random Variables, p. 1. On the other hand, sometimes we can get limits like this as our sensors get better tuned to the signal. ^

Manual trackback: Homoclinic Orbit

Enigmas of Chance

Posted by crshalizi at March 18, 2013 00:20 | permanent link

March 16, 2013

"Frequentist Accuracy of Bayesian Estimates" (This Week at the Statistics Seminar)

A speaker who needs no introduction (but will get one), on a topic whose closeness to my heart needs no elaboration (but will get it):

Bradley Efron, "Frequentist Accuracy of Bayesian Estimates"
Abstract: In the absence of prior information, popular Bayesian estimation techniques usually begin with some form of "uninformative" prior, intended to have minimal inferential influence. Bayes's rule will still produce nice-looking estimates and credible intervals, but these lack the logical force attached to genuine priors, and require further justification. This talk concerns computational formulas that produce frequentist accuracy assessments for Bayesian estimates. Both encouraging and cautionary examples will be presented.
Time and place: 4:30--5:30 pm on Monday, 18 March 2013 in the McConomy Auditorium, University Center

Memo to self: ask Efron what he thinks of Fraser's "Is Bayes Posterior Just Quick and Dirty Confidence?" (arxiv:1112.5582).

Enigmas of Chance; Bayes, anti-Bayes

Posted by crshalizi at March 16, 2013 21:24 | permanent link

Ten Years of Sloth

Attention conservation notice: I look back on my works with smug complacency.

I started this weblog in January 2003; I don't remember exactly when, and date on files got messed up by various changes from Blogger to Movable Type to Blosxom, where it has remained ever since. So let's say, ten years and a bit. I have also just turned 39, so I will indulge myself by looking back at this companion of my thirties.

It's been a big undertaking — 600,000 words in 1071 posts before this — but it's also been good to me. It's gotten me recognition and prizes, and friends, and even helped me get my job, which I love. There have been downsides, over and above the time and effort, but they're mostly personal, and I've learned not to broadcast those. If I had realized how much of my public persona this was going to be, I'd have thought more about giving it such an obscure name (it comes from an old family joke, about how I did everything very slowly as a boy), but I like to think I'd have done it in the end.

As far as I can judge, the blog's best days were 2005--2010, after I'd learned how to use the medium but before work and (what else?) sloth reduced me to merely posting notices about lecture notes. I imagine I will keep it up for the next ten years, in some mode or other, if only because I am a creature of habit. Whether anyone will still bother reading it then, who knows?

Have I, as the poet asks, ever really helped anybody but myself with this? If not, it wouldn't be worth much. I feel that I've never really gotten across why I do this, and I'm not very happy with what follows, but it's better than the discarded drafts at least: Through no merit of my own (at best, persistence in taking exploiting luck), I have a position and skills which mean a few people will pay some attention to me about a handful of subjects, and this obliges me to try to do some good with this sliver of influence. (I do less good in the world than those following any of a hundred thankless and anonymous callings — which I would hate and be bad at.) The means I am most comfortable with are negative — critique, debunking, sarcasm; only rarely do I praise or build up, and my efforts in those directions are unconvincing even to me. My negative posts have helped give me a reputation for erudition and for venom, but their value, if they have any, is helping readers see how they could do better themselves. We have it in ourselves, together, to discover wonders and create marvels, and yet our world buries us in nonsense and inflicts pointless cruelty. We can do better, and I hope what I write helps, in the tiniest of ways, to help my readers find their way there.

I'll finish by being really self-indulgent, and pointing out what I think of as twenty of my best pieces. (I'd pick a different twenty another day.)

  1. Persecution and the Art of Neoconservative Writing
  2. On the History of Inner Asia, as Reflected in Its Intestinal Flora
  3. Liberty! What Fallacies Are Committed in Thy Name!
  4. Networks and Netwars
  5. George W. Bush Darkens Counsel by Words without Knowledge
  6. Our New Filtering Techniques Are Unstoppable!
  7. Cronyism, Corruption and Incompetence: A Network Analysis
  8. Graphs, Trees, Materialism, Fishing
  9. A. R. Luria: The Neuropsychology of Praxis
  10. Why Oh Why Can't We Have Better Econophysics?
  11. The Khaldun Option
  12. Yet More on the Heritability and Malleability of IQ
  13. g, a Statistical Myth
  14. Ghost Peaks, Buried in Ice
  15. Bayes < Darwin-Wallace
  16. In which Dunning-Krueger meets Slutsky-Yule, and they make music together
  17. The Neutral Model of Inquiry (or, What Is the Scientific Literature, Chopped Liver?)
  18. The Singularity in Our Past Light-Cone
  19. "They've traded more for cigarettes / than I've managed to express"; or, Dives, Lazarus, and Alice
  20. In Soviet Union, Optimization Problem Solves You

Manual trackback: the blog formerly known as The Statistical Mechanic; Brad DeLong

Self-Centered

Posted by crshalizi at March 16, 2013 21:11 | permanent link

March 05, 2013

Multivariate Distributions (Advanced Data Analysis from an Elementary Point of View)

Reminders about multivariate distributions. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; conditional distributions in multivariate Gaussians and linear regression; computational aspects, specifically in R. General methods for estimating parametric distributional models in arbitrary dimensions: moment-matching and maximum likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison by cross-validation and by likelihood ratio tests; goodness of fit by the random projection trick.

Reading: Notes, chapter 14

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at March 05, 2013 10:30 | permanent link

February 28, 2013

Books to Read While the Algae Grow in Your Fur, February 2013

My reading this month was either student papers, university busy-work, or else unsatisfying, with two exceptions.

Catherine Jinks, The Reformed Vampire Support Group
Mind candy. A lot has been written about vampires as somehow-metaphorical expressions of fears about AIDS (and so a link between sex, blood, and death), but I don't think I've ever seen them done, consistently, as hapless sufferers from a debilitating chronic illness before. Funny, though it could probably have been compressed a bit without loss.
Alastair Reynolds, The Prefect
Mind candy: a hybrid of space opera and police procedural, where the plot turns on the not-really-parliamentary procedures of a solar-system-spanning electronic participatory democracy, with each little worldlet free to follow its own bizarre social path, so long as universal suffrage for the polls is respected. (I can't tell if there's a direct debt to Nozick, or only indirectly through MacLeod.) In the same universe as Reynold's Revelation Space and sequels (more specifically Chasm City), but stand-alone.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica

Posted by crshalizi at February 28, 2013 23:59 | permanent link

Generalized Linear Models and Generalized Additive Models (Advanced Data Analysis from an Elementary Point of View)

Iteratively re-weighted least squares for logistic regression re-examined: coping with nonlinear transformations and model-dependent heteroskedasticity. The common pattern of generalized linear models and IRWLS. Binomial and Poisson regression. The extension to generalized additive models.

Extended example: building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.

Reading: Notes, chapter 13
Faraway, section 3.1, chapter 6, chapter 7

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 28, 2013 10:30 | permanent link

February 26, 2013

Exam: Nice Demo City, But Will It Scale? (Advanced Data Analysis from an Elementary Point of View)

In which we compare the power-law scaling model of urban economies due to Bettencourt, West, et al. to an alternative in which city size is actually irrelevant.

This was a one-week take-home exam, intended to use more or less everything taught so far.

Assignment, master data set (each student got a random subset of this)

Not exactly a model exam paper.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 26, 2013 11:50 | permanent link

Logistic Regression (Advanced Data Analysis from an Elementary Point of View)

Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models.

Reading: Notes, chapter 12
Faraway, chapter 2 (skipping sections 2.11 and 2.12)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 26, 2013 10:30 | permanent link

February 19, 2013

Homework: How the Hyracotherium Got Its Mass (Advanced Data Analysis from an Elementary Point of View)

In which extinct charismatic megafauna give us an excuse to specification test.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 19, 2013 11:50 | permanent link

Testing Regression Specifications (Advanced Data Analysis from an Elementary Point of View)

Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using simulation from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.

Reading: Notes, chapter 10; R for in-class demos
Cox and Donnelly, chapter 7
Optional reading: Spain et al., "Testing the Form of Theoretical Models by Relaxing Assumptions: Comparing Parametric and Nonparametric Models", ssrn/2164297

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 19, 2013 10:30 | permanent link

February 18, 2013

Elsewhere

Blogging will stay non-existent sporadic while I struggle to get enough ahead of the 96 students in ADA that I can do some research devote myself to contemplating the mysteries of the universe and helping young minds develop their own powers. In the meanwhile, if you want more:

Manual trackbacks (of sorts): The Browser; Brad DeLong; Radio Free Europe/Radio Liberty (!); Marginal Revolution

Self-centered; Bayes, anti-Bayes; The Collective Use and Evolution of Concepts; The Progressive Forces

Posted by crshalizi at February 18, 2013 15:30 | permanent link

February 14, 2013

Lecture: Additive Models (Advanced Data Analysis from an Elementary Point of View)

The "curse of dimensionality" limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimension. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This goes beyond linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. (This is like automatically finding an optimal transformation of each predictor before putting it into a linear model.) Examples in R using the California house-price data. Conclusion: there are no statistical reasons to prefer linear models to additive models, hardly any scientific reasons, and increasingly few computational ones; the continued thoughtless use of linear regression is a scandal.

Reading: Notes, chapter 9

Optional reading: Faraway, chapter 12

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 14, 2013 10:30 | permanent link

February 12, 2013

Homework: It's Not the Heat that Gets to You, It's the Sustained Heat with Pollution (Advanced Data Analysis from an Elementary Point of View)

In which spline regression becomes a matter of life and death in Chicago.

Assignment

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 12, 2013 11:30 | permanent link

Lecture: Splines (Advanced Data Analysis from an Elementary Point of View)

Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.

Reading: Notes, chapter 8

Optional reading: Faraway, section 11.2.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 12, 2013 10:30 | permanent link

February 07, 2013

Lecture: Heteroskedasticity, Weighted Least Squares, and Variance Estimation (Advanced Data Analysis from an Elementary Point of View)

Weighted least squares estimates, to give more emphasis to particular data points. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.

Reading: Notes, chapter 7
Optional reading: Faraway, section 11.3.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 07, 2013 10:30 | permanent link

February 05, 2013

Homework: How the North American Mammalian Paleofauna Got a Crook in Its Regression Line (Advanced Data Analysis from an Elementary Point of View)

In which extinct charismatic megafauna give us an excuse to practice basic programming, fitting nonlinear models, and bootstrapping.

Assignment, nampd.csv data set, R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 05, 2013 11:30 | permanent link

Lecture: Writing R Code (Advanced Data Analysis from an Elementary Point of View)

R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.

Reading: Notes, Appendix A

Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 18.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at February 05, 2013 10:30 | permanent link

January 31, 2013

Books to Read While the Algae Grow in Your Fur, January 2013

Attention conservation notice: I have no taste.

Kage Baker, The Hotel Under the Sand
Mind candy, of the "utterly charming fantasy for kids" variety. (With thanks to Rosemary M. for sharing her copy.)
Bruce Sterling, Love Is Strange: A Paranormal Romance
In which Bruce Sterling tries his hand at a paranormal romance novel, and the result is, well, about what you'd expect if the contents of Beyond the Beyond were blended with the love-story of a futurist accountant and a witch, who meet at a futurological congress, and most (all?) of the formal conventions of the romance novel are respected — I won't reveal whether there is a happily-ever-after.
ObChairmanBruce: "Paranormal Romance is a tremendous, bosom-heaving, Harry-Potter-sized, Twilight-shaped commercial success. It sorta says everything about modern gender relations that the men have to be supernatural. It also says everything about humanity that we're so methodically training ourselves to be intimate partners of entities that aren't human."
Update: Remarks by Warren Ellis. (I liked the protagonists more than he did.)
Felix Gilman, The Rise of Ransom City
A sequel to The Half-Made World, though it can be read independently. It takes up further strands of the Matter of America: the myths of the self-made man and of the inventor. It's about traveling wonder-shows; celebrity; bad debts and borrowings that cannot be acknowledged or repaid; corporate take-overs; energy too cheap to meter; the worship of the bitch-goddess Success; the collapse of Powers sustained by faith; beginning the world over again in the blank spaces of the map; the Bomb. Above all, it's a fantastical portrait of the hustling imposter in the same skin as the self-taught genius.
Like its predecessor, it's vastly entertaining, but it goes beyond mere mind-candy. I have a hard time imagining finishing a better novel this year.
Colin Renfrew, Prehistory: Making of the Human Mind
Decent over-view of the history of the study of pre-history, including summaries of findings about, e.g., human evolution, the dispersal out of Africa, the development of early states in various parts of the world, etc., without attempting anything like a comprehensive account of human pre-history. The second half is an extended plea for "cognitive archaeology", and studying how our ancestors' "engagement" with the material world led to developing new concepts*, and to the links between concepts and institutions. This seems fine as far as it goes, but fairly obvious and not specific enough to be really helpful. I'd even say it's full of loose ends; he introduces it by talking about the "sapient paradox", the gap between when anatomically modern humans appear and when we find archaeological evidence of (comparatively) rapid cumulative cultural development, but never really explains this that I can see**.
*: That this sounds rather like Marx and Engels in The German Ideology is no coincidence, comrades. The book however leaves me unclear whether this is through direct exposure, or second hand via modern epigones.
**: The same chapters come down hard in favor of the mind being embodied and social. I am sympathetic to both positions, but his arguments are weak. E.g., he repeatedly conflates whether some skill or other is "in the brain" with whether it could be learned, or exercised, by a brain which wasn't attached to a body, which is very bizarre. (The brain always changes when acquiring a new manual skill; the muscles of the hands may do so.)
Warren Ellis, Gun Machine
A distinguished English author records his impressions of the domestic manners of the Americans. That is to say, mind candy about assassination, ancestor-grabbing, privatized policing and surveillance, and high-frequency trading in Manhattan.

Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; The Collective Use and Evolution of Concepts; The Beloved Republic; Pleasures of Detection, Portraits of Crime; Natural Science of the Human Species

Posted by crshalizi at January 31, 2013 23:59 | permanent link

Lecture: The Bootstrap (Advanced Data Analysis from an Elementary Point of View)

The sampling distribution is the source of all knowledge regarding statistical uncertainty. Unfortunately, the true sampling distribution is inaccessible, since it is a function of exactly the quantities we are trying to infer. One exit from this vicious circle is the bootstrap principle: approximate the true sampling distribution by simulating from a good model of the process, and treating the simulation data just like the data. The simplest form of this is parametric bootstrapping, i.e., simulating from the fitted model. Nonparametric bootstrapping means simulating by re-sampling, i.e., by treating the observed sample as a complete population and drawing new samples from it. Bootstrapped standard errors, biases, confidence intervals, p-values. Tricks for making the simulated distribution closer to the true sampling distribution (pivotal intervals, studentized intervals, the double bootstrap). Bootstrapping regression models: by parametric bootstrapping; by resampling residuals; by resampling cases. Many, many examples. When does the bootstrap fail?

Note: Thanks to Prof. Christopher Genovese for delivering this lecture while I was enjoying the hospitality of the fen-folk.

Reading: Notes, chapter 6 (R for figures and examples; pareto.R; wealth.dat);
Lecture slides; R for in-class examples
Cox and Donnelly, chapter 8

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 31, 2013 10:30 | permanent link

January 29, 2013

Homework: An Insufficiently Random Walk Down Wall Street (Advanced Data Analysis from an Elementary Point of View)

In which, by way of getting comfortable with simulations and the bootstrap, we look at just how detached from reality standard financial models are. (And, in the hidden curriculum, get comfortable writing R functions.)

Assignment, SPhistory.short.csv

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 29, 2013 11:30 | permanent link

Lecture: Simulation (Advanced Data Analysis from an Elementary Point of View)

Simulation is implementing the story encoded in the model, step by step, to produce something which should look like the data. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random variables. Methods for generating random variables with specified distributions: the transformation or inverse-quantile method; the rejection method; Markov chain Monte Carlo (Metropolis or Metropolis-Hastings method). Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.

Reading: Notes, chapter 5 (but sections 5.4--5.6 are optional this year); R

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 29, 2013 10:30 | permanent link

January 26, 2013

When Fen-Dwelling Bayesians Can't Handle the Truth

Attention conservation notice: Self-promotion of an academic talk, based on a three-year-old paper, on the arcana of how Bayesian methods work from a frequentist perspective.

Because is snowing relentlessly and the occasional bout of merely freezing air is a blessed relief, I will be escaping to a balmier clime next week: Cambridgeshire.

"When Bayesians Can't Handle the Truth", statistics seminar, Cambridge University
Abstract: There are elegant results on the consistency of Bayesian updating for well-specified models facing IID or Markovian data, but both completely correct models and fully observed states are vanishingly rare. In this talk, I give conditions for posterior convergence that hold when the prior excludes the truth, which may have complex dependencies. The key dynamical assumption is the convergence of time-averaged log likelihoods (Shannon-McMillan-Breiman property). The main statistical assumption is a building into the prior a form of capacity control related to the method of sieves. With these, I derive posterior and predictive convergence, and a large deviations principle for the posterior, even in infinite-dimensional hypothesis spaces; and clarify role of the prior and of model averaging as regularization devices. (Paper)
Place and time: 1 February 2013, 4--5 pm in MR 12, CMS

Manual trackback: Brad DeLong

Self-Centered; Bayes, anti-Bayes

Posted by crshalizi at January 26, 2013 22:24 | permanent link

Homework: The Advantages of Backwardness (Advanced Data Analysis from an Elementary Point of View)

In which we try to discern whether poor countries grow faster, by way of practicing with data-set-splitting, cross-validation, and nonparametric smoothing.

Assignment, R, penn-select.csv data set

Incidentally, I have no idea who is responsible for this —

— and would prefer not to be told.

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2013 21:38 | permanent link

Lecture: Smoothing Methods in Regression (Advanced Data Analysis from an Elementary Point of View)

Lecture 4: The bias-variance trade-off tells us how much we should smooth. Some heuristic calculations with Taylor expansions for general linear smoothers. Adapting to unknown roughness with cross-validation; detailed examples. How quickly does kernel smoothing converge on the truth? Using kernel regression with multiple inputs. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.

Reading: Notes, chapter 4 (R)
Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2013 21:37 | permanent link

Lecture: Model Evaluation, Error and Inference (Advanced Data Analysis from an Elementary Point of View)

Lecture 3, Model evaluation: error and inference. Statistical models have three main uses: as ways of summarizing (reducing, compressing) the data; as scientific models, facilitating actually scientific inference; and as predictors. Both summarizing and scientific inference are linked to prediction (though in different ways), so we'll focus on prediction. In particular for now we focus on the expected error of prediction, under some particular measure of error. The distinction between in-sample error and generalization error, and why the former is almost invariably optimistic about the latter. Over-fitting. Examples of just how spectacularly one can over-fit really very harmless data. A brief sketch of the ideas of learning theory and capacity control. Data-set-splitting as a first attempt at practically controlling over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences.

Reading: Notes, chapter 3 (R)
Cox and Donnelly, ch. 6

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2013 21:36 | permanent link

Lecture: The Truth About Linear Regression (Advanced Data Analysis from an Elementary Point of View)

Lecture 2, The truth about linear regression: Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions (non-Gaussian noise, non-independent noise, non-constant variance, dependent predictors). Linear regression coefficients will change with the distribution of the input variables: examples. Why \( R^2 \) is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What it really means when coefficients are significantly non-zero. What "controlled for in a linear regression" really means.

Reading: Notes, chapter 2 (R); Notes, appendix B
Optional reading: Faraway, rest of chapter 1

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 26, 2013 21:35 | permanent link

January 16, 2013

Aaron Swartz, requiescat in pace

I can't claim he was a close friend. We knew each other through a correspondence which began in 2006. We only met face to face once, when I was able to break away from a conference in the city he was living in for an afternoon of conversation. He was the same modest, charming, morally serious, funny and multi-facetedly brilliant young man in person that he was on paper.

For whatever unfathomable reason, he liked my posts. For my part, when I write I try to imagine the reaction of an ideal audience of a few people, and Aaron was one of them. More: I've often asked myself why I don't do as much to act on my convictions as he did; he pricked my conscience. If, on Friday morning, you'd asked me which, if any, of my acquaintances would ever be remembered as a great American, I'd have named him without even thinking about it. Who else was so ridiculously accomplished at such an age, so devoted to good principles without being a prig, so focused on removing the real stumbling blocks to progress?

I have been angry with Aaron, off and on but always stupidly, ever since I heard the news. But I am furious at the idiots who hounded him to death for the sake of the profits of parasites, and the restriction of knowledge.

Linkage

Posted by crshalizi at January 16, 2013 23:35 | permanent link

Homework: What's That Got to do with the Price of Condos in California? (Advanced Data Analysis from an Elementary Point of View)

In which students who have spent half a year learning linear regression are tormented we explore some of the limitations of linear models, by way of warming up for more flexible ones.

assignment, data

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 16, 2013 23:18 | permanent link

Lecture: Regression: Predicting and Relating Quantitative Features (Advanced Data Analysis from an Elementary Point of View)

Statistics is the branch of mathematical engineering which designs and analyzes methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.

Reading: Notes, chapter 1 (examples.dat for running example; ckm.csv data set for optional exercises)

Advanced Data Analysis from an Elementary Point of View

Posted by crshalizi at January 16, 2013 23:15 | permanent link

January 11, 2013

End of Year Inventory, 2012

Attention conservation notice: Navel-gazing.

Paper manuscripts completed: 5
Papers accepted: 2
Papers rejected: 2 (fools! we'll show you all!)
Papers in refereeing limbo: 5
Papers with co-authors waiting for me to revise: 3
Invited papers finished: 2
Invited papers in progress: 2
Other papers in progress: I won't look in that directory and you can't make me

Grant proposals submitted: 3
Grant proposals funded: 1 (from last year)
Grant proposals rejected: 1 (fools! we'll show you all!)
Grant proposals in refereeing limbo: 2
Grant proposals in progress for next year: 1
Grant proposals refereed: 16

Talk given and conferences attended: 23, in 13 cities

Classes taught: 2 [i, ii]
New classes taught: 0
Summer school classes taught: 1
New summer school classes taught: 1
Pages of new course material written: about 400

Manuscripts refereed: 47
Number of times I was asked to referee my own manuscript: 2
Number of times I did so: 0
Manuscripts waiting for me to referee: 6
Manuscripts for which I was the responsible associate editor at Annals of Applied Statistics: 4
Senior program committees joined: 1
Book proposals reviewed: 3
Book proposals submitted: 1 (see "New course material", above)
Book proposals accepted: 1
Book manuscripts completed: 0

Students who completed their dissertations: 2 (i, ii)
Students preparing to propose in the spring: 2
Letters of recommendation sent: 40+

Promotion received: 1 (to associate professor without tenure)
Months until tenure packet is due: 18

Book reviews published on dead trees: 1

Weblog posts: 133
Substantive posts: 40, counting algal growths
Incomplete posts in the drafts folder: 36
Incomplete posts transferred to the papers-in-progress folder: 2
Weblog posts that won prizes: 1
Thoughtful, detailed guest posts by eminent statisticians which provoked responses from Nobel-Prize-winning economists at Princeton: 1
Splenetic comments written elsewhere, after too much to drink at a party, which provoked responses from Nobel-Prize-winning economists at Princeton: 1

Books acquired: 320
Books begun: 188
Books finished: 160
Books given up: 5
Books sold: 1040 (not a typo)
Books donated: 130

Legal recognitions of major life transitions: 1
Actual major life transitions: 0
Real estate transactions: 1

Self-Centered

Posted by crshalizi at January 11, 2013 22:25 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems