Lecture 16: The Monte Carlo principle for numerical integrals: write your integral as an expectation, take a sample. Examples. Importance sampling: draw from a distribution other than the one you really are want, then weight the sample values. Markov chain Monte Carlo for sampling from a distribution we do not completely know: the Metropolis algorithm. Gibbs sampling. Bayesian inference via MCMC.
Readings: Handouts on Markov Chains and Monte Carlo, and on Markov Chain Monte Carlo
Optional readings: Charles Geyer, "Practical Markov Chain Monte Carlo", Statistical Science 7 (1992): 473--483; "One Long Run"; Burn-In is Unnecessary; On the Bogosity of MCMC Diagnostics.
Update, 22 December: If you do read Geyer, it's also worth reading two posts by Andrew Gelman (A Centipede Many Times Over and A Tale of Two Discussion Papers), and Gelman and Rubin's "Inference from Iterative Simulation Using Multiple Sequences" (Statistical Science 7 (1992): 457--472). Thanks to Andy for reminding me (politely!) about these pieces.
Posted by crshalizi at October 21, 2013 13:58 | permanent link
Lecture 15: Combing multiple dependent random variables in a simulation; ordering the simulation to do the easy parts first. Markov chains as a particular example of doing the easy parts first. The Markov property. How to write a Markov chain simulator. Verifying that the simulator works by looking at conditional distributions. Variations on Markov models: hidden Markov models, interacting processes, continuous time, chains with complete connections. Asymptotics of Markov chains via linear algebra; the law of large numbers (ergodic theorem) for Markov chains: we can approximate expectations as soon as we can simulate.
Readings: Handouts on Markov chains and Monte Carlo
Posted by crshalizi at October 21, 2013 13:57 | permanent link
Lecture 14: Why simulate? Generating random variables as first step. The built-in R commands: rnorm, runif, etc.; sample. Some uses of sampling: permutation tests; bootstrap standard errors and confidence intervals. Transforming uniformly-distributed random variables into other distributions: the quantile trick; the rejection method; illustration of the rejection method. Understanding pseudo-random number generators: irrational rotations; the Arnold cat map as a toy example of an unstable dynamical system; illustrations of the Arnold cat map. Controlling the random number seed.
Readings: Matloff, chapter 8; The R Cookbook, chapter 8
Posted by crshalizi at October 21, 2013 13:56 | permanent link
Attention conservation notice: Late notice of a very technical presentation about theoretical statistics in a city you don't live in.
Today's speaker needs no introduction for those interested in modern, high-dimensional statistics (but will get an introduction anyway):
As always, the talk is free and open to the public.
Posted by crshalizi at October 21, 2013 13:55 | permanent link
Midterm Exam: eight questions about thirteen lines of code.
Posted by crshalizi at October 11, 2013 17:48 | permanent link
Lecture 13, Split/apply/combine II: using plyr. Abstracting the split/apply/combine pattern: using a single function to appropriately split up the input, apply the function, and combine the results, depending on the type of input and output data. Syntax details. Examples: standardizing measurements for regularly-sampled spatial data; standardizing measurements for irregularly-sampled spatial data; more fun with strikes and left-wing politics. Limitations of the split/apply/combine pattern.
Shorter lecture 13: The lecturer is a gushing Hadley Wickham fanboy.
(This week's lectures are ripped off from slides by Vince Vu, with permission.)
Posted by crshalizi at October 11, 2013 17:47 | permanent link
Lecture 12: Design patterns and their benefits: clarity on what is to be done, flexibility about how to do it, ease of adapting others' solutions. The split/apply/combine pattern: divide big structured data sets up into smaller, related parts; apply the same analysis to each part independently; combine the results of the analyses. Trivial example: row and column means. Further examples. Iteration as a verbose, painful and clumsy implementation of split/apply/combine. Tools for split/apply/combine in basic R: the apply function for arrays, lapply for lists, mapply, etc.; split. Detailed example with a complicated data set: the relation between strikes and parliamentary politics.
Posted by crshalizi at October 11, 2013 17:46 | permanent link
In which we continue to practice using functions as arguments and as return values, while learning something about the standard error of maximum likelihood estimates, and about the modularity of methods like the jack-knife.
Posted by crshalizi at October 11, 2013 17:45 | permanent link
In which we practice passing functions as arguments to other functions, by way of an introduction to likelihood and its maximization; and, incidentally, work more with plotting in R.
Posted by crshalizi at October 11, 2013 17:44 | permanent link
Lecture 11: Abstraction as a way to make programming more friendly to human beings. Refactoring as a form of abstraction. The rectification of names. Consolidation of related values into objects. Extracting common operations. Defining general operations. Extended example with the jackknife.
Reading: sections 14.1--14.3 in Matloff.
Posted by crshalizi at October 11, 2013 17:43 | permanent link
Lecture 10: Basics from calculus about minima. Taylor series. Gradient descent and Newton's method. Curve-fitting by optimization. Illustrations with optim and nls. R for examples
Reading: recipes 13.1 and 13.2 in The R Cookbook; chapters I.1, II.1 and II.2 in Red Plenty
Posted by crshalizi at October 11, 2013 17:42 | permanent link
In which we continue to practice the arts of debugging and testing, while learning about making our code more general, handling awkward special cases, and pondering what it means to say that an observation is an outlier.
Posted by crshalizi at October 11, 2013 17:41 | permanent link
In which we use Tukey's rule for identifying outliers as an excuse to learn about debugging and testing.
Posted by crshalizi at October 11, 2013 17:40 | permanent link
Lecture 10: Functions in R are objects, just like everything else, and so can be both arguments to and return values of functions, with no special machinery required. Examples from math (especially calculus) of functions with other functions as arguments. Some R syntax relating to functions. Examples with curve. Using sapply to extend functions of single numbers to functions of vectors; its combination with curve. We write functions with lower-level functions as arguments to abstract out a common pattern of operations. Example: calculating a gradient. Numerical gradients by first differences, done two different ways. (Limitations of taking derivatives by first differences.) Incorporating this as a part of a larger algorithm, such as gradient descent. Using adapters, like wrapper functions and anonymous functions, to fit different functions together. Examples from math (especially calculus) of operators, which turn one function into another. The importance of scoping when using functions as return values. Example: creating a linear predictor. Example: implementing the gradient operator (two different ways). Example: writing surface, as a two-dimensional analog to the standard curve. The use of eval and substitute to control when and in what context an expression is evaluated. Three increasingly refined versions of surface, employing eval.
Posted by crshalizi at October 11, 2013 17:39 | permanent link
Attention conservation notice: Only relevant if you (1) really care about statistics, and (2) will be in Pittsburgh on Monday.
Through a fortuitous concourse of calendars, we will have three outstanding talks on Monday, 14 October 2013. In chronological order:
As always, the talks are free and open to the public.
Posted by crshalizi at October 11, 2013 17:27 | permanent link
Have some delayed book-chat from April, May, June, July, August and September.
Posted by crshalizi at October 01, 2013 17:50 | permanent link
Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; The Beloved Republic; Networks; Commit a Social Science
Posted by crshalizi at September 30, 2013 23:59 | permanent link
Attention conservation notice: Only of interest if you (1) care about computational statistics, and (2) will be in Pittsburgh next Monday.
Having a talk on Bayesian computational statistics by a Dr. Scott worked so well last time, we're doing it again:
Posted by crshalizi at September 24, 2013 16:00 | permanent link
Attention conservation notice: Log-rolling promotion a conference at the intersection of the margins of several academic fields.
I've written before about how one of Causation, Prediction and Search was one of the books which awakened my interest in machine learning and modern statistics. The point of the book was to explore when and how one can actually discover causal relations from observations. The CMU philosophy department being what it is, they did this by devising computational representations of causal structure, and effective procedures for causal discovery, and proving that the procedures would work under specific (sane) conditions. This message has shaped my research and my teaching ever since. It's one of the reasons I was so eager to come to CMU.
Of course, for good pragmatists, the proof of any method is in its results, and that's why I'm very pleased to help publicize this:
(Yet another sign of the passage of time is that one of the organizers is Lizzie Silver, who helped perpetrate this when she took 36-350.)
Posted by crshalizi at September 24, 2013 15:00 | permanent link
Undelivered optional lecture on Scope: R looks for the values of names in the current environment; if it cannot find a value, it looks for the name in the environment which spawned this one, and so on up the tree to the common, global environment. Assignment is modifying the name/value association list which represents the environment. The scope of a name is limited by the current environment. Implications: changes within the current scope do not propagate back to the larger environments; changes in the larger environment do propagate to all smaller ones which it encloses, unless over-ridden by local names. Subtlety: the larger environment for a function is the one in which it was defined, not the one in which it is called. Some implications for design. Examination of the last homework from this stance.
(I've decided not to actually give this lecture this year, in the interest of fitting in more substantive statistical and data-processing content, but also to post it for reference.)
Posted by crshalizi at September 23, 2013 11:30 | permanent link
Lecture 8, Debugging: Debugging as differential diagnosis: characterize the bug, localize it in the code, try corrections. Tactics for characterizing the bug. Tactics for localizing the bug: traceback, print, warning, stopifnot. Test cases and dummy input generators. Interactive debuggers. Programming with an eye to debugging: writing code with comments and meaningful names; designing the code in a top-down, modular, functional manner; writing and using tests. A hint at the exception-handling system.
Posted by crshalizi at September 23, 2013 10:30 | permanent link
In which we meet the jackknife, by way of seeing how much error there is in our estimates from the last lab.
Posted by crshalizi at September 20, 2013 10:30 | permanent link
In which we meet the parametric bootstrap traveling
incognito probe the precision of our estimation method from the last
lab, by seeing how well it would work when the model is true and we know the
parameters.
Posted by crshalizi at September 19, 2013 11:55 | permanent link
Lecture 7: Our code implements a method for solving problems we expect to encounter in the future; but why should we trust those solutions? We establish the reliability of the code by testing it. To respect the interfaces of the code, we test the substance of the answers, not the procedure used to obtain them, even though it is the reliability of the procedure we ultimate care about. We test both for the actual answer in particular cases and by cross-checking different uses of the same code which should lead to the same answer. Because we do not allow our tests to give us any false alarms, their power to detect errors is limited, and must be focused at particular kinds of errors. We make a virtue of necessity by using a diverse battery of tests, and shaping the tests so that they tell us where errors arise. The testing-programming cycle alternates between writing code and testing its correctness, adding new tests as new errors are discovered. The logical extreme of this is test-driven development, where tests represent the specification of the software's behavior in terms of practical consequences. Drawbacks of testing. Some pointers to more advanced tools for writing, maintaining and using tests in R.
(Why yes, this lecture was something of a lay sermon on epistemology.)
Posted by crshalizi at September 18, 2013 10:30 | permanent link
Lecture 6: Top-down design is a recursive heuristic for solving problems by writing functions: start with a big-picture view of the problem; break it into a few big sub-problems; figure out how to integrate the solutions to each sub-problem; and then repeat for each part.
Exemplification: how we could write the lm function for linear regression, if it did not exist and it were necessary to invent it.
Additional optional reading: Herbert Simon, The Sciences of the Artificial.
Posted by crshalizi at September 16, 2013 10:30 | permanent link
Attention conservation notice: Only of interest if you care a lot about computational statistics.
For our first seminar of the year, we are very pleased to have a talk which will combine two themes close to the heart of the statistics department:
As always, the talk is free and open to the public.
— A
slightly cynical historical-materialist
take on the rise of Bayesian statistics is that it reflects a phase in the
development of the means of computation, namely the PC era. The theoretical or
ideological case for Bayesianism was pretty set by the early 1960s, say with
Birnbaum's argument for the
likelihood principle1. It
nonetheless took a generation or more for Bayesian statistics to actually
become common. This is because, under the material conditions of the early 1960s, such ideas could be
only be defended and not applied.
What changed this was not better theory, or better models, or a sudden
awakening to the importance
of shrinkage
and partial pooling. Rather, it became possible to actually calculate
posterior distributions. Specifically, Monte Carlo methods developed in
statistical mechanics permitted stochastic approximations to non-trivial
posteriors. These Monte Carlo techniques quickly became (pardon the
expression) hegemonic within Bayesian statistics, to the point where I have met
younger statisticians who thought Monte Carlo was a Bayesian
invention2. One of the ironies of
applied Bayesianism, in fact, is that nobody actually knows the posterior
distribution which supposedly represents their beliefs, but rather
(nearly3) everyone works out that
distribution by purely frequentist inference from Monte Carlo samples.
("How do I know what I think until I see what the dice say?", as it were.)
So: if you could do Monte Carlo, you could work out (approximately) a posterior distribution, and actually do Bayesian statistics, instead of talking about it. To do Monte Carlo, you needed enough computing power to be able to calculate priors and likelihoods, and to do random sampling, in a reasonable amount of time. You needed a certain minimum amount of memory, and you needed clock speed. Moreover, to try out new models, to tweak specifications, etc., you needed to have this computing power under your control, rather than being something expensive and difficult to access. You needed, in other words, a personal computer, or something very like it.
The problem now is that while our computers keep getting faster, and their internal memory keeps expanding, our capacity to generate, store, and access data is increasing even more rapidly. This is a problem if your method requires you to touch every data point, and especially a problem if you not only have to touch every data point but do all possible pairwise comparisons, because, say, your model says all observations are dependent. This raises the possibility that Bayesian inference will become computationally infeasible again in the near future, not because our computers have regressed but because the size and complexity of interesting data sets will have rendered Monte Carlo infeasible. Bayesian data analysis would then have been a transient historical episode, belonging to the period when a desktop machine could hold a typical data set in memory and thrash through it a million times in a weekend.
Of course, I don't know that Bayesian inference is doomed to become obsolete because it will grow computationally intractable. One possibility is that "Bayesian inference" will be redefined in ways which depart further and further from the noble-Savage ideal, but are computationally tractable — variational Bayes, approximate Bayesian computation, and the generalized updating of Bissiri et al. are three (very different) moves in that direction. Another possibility is that algorithm designers are going to be clever enough to make distributed Monte Carlo approximations for posteriors as feasible as, say, a distributed bootstrap. This is, implicitly, the line Scott is pursuing. I wish him and those like him every success; whatever the issues with Bayesianism and some of its devotees, the statistical world would lose something valuable if Bayes as we know it were to diminish into a relic.
Update, 16 September 2013: It apparently needs saying that the ill-supported speculations here about the past and future of Bayesian computing are mine, not Dr. Scott's.
Update, 18 December 2013: "Asymptotically Exact, Embarrassingly Parallel MCMC" by Neiswanger et al. (arxiv:1311.4780) describes and analyses a very similar scheme to that proposed by Dr. Scott.
Manual trackback: Homoclinic Orbit
[1]: What Birnbaum's result actually proves is another story for another time; in the meanwhile, see Mayo, Evans and Gandenberger. ^
[2]: One such statistician persisted in this belief after reading Geyer and Thompson, and even after reading Metropolis et al., though there were other issues at play in his case. ^
[3]: The most interesting exception to this I know of is Rasmussen and Ghahramani's "Bayesian Monte Carlo" (NIPS, 2002). But despite its elegance and the reputations of its authors, it's fair to say this work has not had much impact. ^
Posted by crshalizi at September 14, 2013 23:22 | permanent link
In which we see how to minimize the mean squared error when there are two parameters, in the process learning about writing functions, decomposing problems into smaller steps, testing the solutions to the smaller steps, and minimization by gradient descent.
Posted by crshalizi at September 13, 2013 11:30 | permanent link
In which we practice the arts of writing functions and of estimating distributions, while contemplating just how little room there is in the heart of a cat.
Posted by crshalizi at September 13, 2013 10:30 | permanent link
Lecture 5: Using multiple functions to solve multiple problems; to sub-divide awkward problems into more tractable ones; to re-use solutions to recurring problems. Value of consistent interfaces for functions working with the same object, or doing similar tasks. Examples: writing prediction and plotting functions for power-law scaling models. Advantages of splitting big problems into smaller ones with their own functions: understanding, modification, design, re-use of work. Trade-off between internal sub-functions and separate functions. Re-writing the plotting function to use the prediction function. Recursion. Example: re-writing the resource allocation code to be more modular and recursive. R for examples.
Posted by crshalizi at September 11, 2013 10:30 | permanent link
Lecture 4: Just as data structures tie related values together into objects, functions tie related commands together into objects. Declaring functions. Arguments (inputs) and return values (outputs). Named arguments, defaults, and calling functions. Interfaces: controlling what the function can see and do; first sketch of scoping rules. The importance of the interface. An example of writing and improving a function, for fitting the West et al. model of urban economies by nonlinear least squares. R for examples.
Posted by crshalizi at September 10, 2013 15:52 | permanent link
In which we make incremental improvements to our code for planning by incremental improvements.
Posted by crshalizi at September 10, 2013 15:44 | permanent link
In which we try our hand at controlling the flow of our code, and learn about the reproducibility of randomized procedures.
Posted by crshalizi at September 10, 2013 15:42 | permanent link
In which we practice working with data frames, grapple with some of the subtleties of R's system of data types, and calculate the consequences of doodling while bored in lecture.
Assignment, due at 11:59 pm on Thursday, 5 September 2013
Posted by crshalizi at September 04, 2013 01:43 | permanent link
In which we play around with basic data structures and convince ourself that the laws of probability are, in fact, right. (Or perhaps that R's random number generator is pretty good.)
Posted by crshalizi at September 04, 2013 01:42 | permanent link
Lecture 3: Conditioning the calculation on the data: if; what is truth?; Boolean operators again; switch. Iteration to repeat similar calculations: for and iterating over a vector; while and conditional iteration (reducing for to while); repeat and unconditional iteration, with break to exit loops (reducing while to repeat). Avoiding iteration with "vectorized" operations and functions: the advantages of the whole-object view; some examples and techniques: mathematical operators and functions, ifelse; generating arrays with repetitive structure.
Posted by crshalizi at September 04, 2013 01:41 | permanent link
Introduction to the course: statistical programming for autonomy, honesty, and clarity of thought. The functional programming idea: write code by building functions to transform input data into desired outputs. Basic data types: Booleans, integers, characters, floating-point numbers. Subtleties of floating point numbers. Operators as basic functions. Variables and names. An example with resource allocation. Related pieces of data are bundled into larger objects called data structures. Most basic data structures: vectors. Some vector manipulations. Functions of vectors. Naming of vectors. Continuing the resource-allocation example. Building more complicated data structures on top of vectors. Arrays as a first vector structure. Matrices as a special type of array; functions for matrix arithmetic and algebra: multiplication, transpose, determinant, inversion, solving linear systems. Using names to make calculations clearer and safer: resource-allocation mini-example. Lists for combining multiple types of values; access sub-lists, individual elements; ways of adding and removing parts of lists. Lists as key-value pairs. Data frames: the data structure for classic tabular data, one column per variable, one row per unit; data frames as hybrids of matrices and lists. Structures of structures: using lists recursively to creating complicated objects; example with eigen.
Posted by crshalizi at September 04, 2013 01:40 | permanent link
It's that time of year again:
Further details can be found at the class website. Teaching materials (lecture slides, homeworks, labs, etc.), will appear both there and here.
Corrupting the Young; Enigmas of Chance; Introduction to Statistical Computing
Posted by crshalizi at September 04, 2013 01:39 | permanent link
Attention conservation notice: Navel-gazing about teaching.
Well, that was exhausting. Everything seemed to take me at least as much effort and time as the year before, despite recycling a lot of material.
There continues to not be enough manpower to give detailed feedback on 90-odd students' weekly data analysis homework.
I had to abandon the quality-control sampling early on the semester, because there just wasn't time on my end. I regret this.
The second exam didn't work well at all; it will need to be replaced in the future. The students' struggles with it suggest that I ought to teach more about categorical data, but I'm not sure what I'd cut to make room for it.
This is all pretty negative, and indeed I'm fairly dis-satisfied with my own performance in teaching this, since it feels like I worked harder and the students didn't learn any more. But the student evaluations for the class and for me were much higher this year than before. I don't know whether this means I've started doing something right without realizing it, or forgotten something important, or indeed changed how I teach at all. But even the kids who ended the course hating my guts claimed they'd learned from it, so I'll count that as a win.
(The main thing to do at this point is to actually finish the book. The biggest job is going to be integrating the homework problems and the exams into the text. The most tedious job is going to be making sure the notation is uniform throughout.)
Posted by crshalizi at September 04, 2013 01:22 | permanent link
Books to Read While the Algae Grow in Your Fur;
Scientifiction and Fantastica;
Pleasures of Detection, Portraits of Crime;
Philosophy;
Writing for Antiquity;
Constant Conjunction Necessary Connection;
The Dismal Science;
Afghanistan and Central Asia;
The Beloved Republic
Posted by crshalizi at August 31, 2013 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Networks; Mathematics; Enigmas of Chance
Posted by crshalizi at July 31, 2013 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; The Great Transformation; Complexity
Posted by crshalizi at June 30, 2013 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; The Dismal Science; Scientifiction and Fantastica; Commit a Social Science; Mathematics; Corrupting the Young; The Commonwealth of Letters
Posted by crshalizi at May 31, 2013 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur Scientifiction and Fantastica
Posted by crshalizi at April 30, 2013 23:59 | permanent link
In which we are devoted to two problems of political economy, viz., strikes, and macroeconomic forecasting.
Posted by crshalizi at April 30, 2013 11:50 | permanent link
What time series are. Properties: autocorrelation or serial correlation; other notions of serial dependence; strong and weak stationarity. The correlation time and the world's simplest ergodic theorem; effective sample size. The meaning of ergodicity: a single increasing long time series becomes representative of the whole process. Conditional probability estimates; Markov models; the meaning of the Markov property. Autoregressive models, especially additive autoregressions; conditional variance estimates. Bootstrapping time series. Trends and de-trending.
Posted by crshalizi at April 30, 2013 10:30 | permanent link
How do we get our causal graph? Comparing rival DAGs by testing selected conditional independence relations (or dependencies). Equivalence classes of graphs. Causal arrows never go away no matter what you condition on ("no causation without association"). The crucial difference between common causes and common effects: conditioning on common causes makes their effects independent, conditioning on common effects makes their causes dependent. Identifying colliders, and using them to orient arrows. Inducing orientation to enforce consistency. The SGS algorithm for discovering causal graphs; why it works. The PC algorithm: the SGS algorithm for lazy people. What about latent variables? Software: TETRAD and pcalg; examples of working with pcalg. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not.
Reading: Notes, chapter 24
Posted by crshalizi at April 25, 2013 10:30 | permanent link
Next week, instead of the regular seminar, the CMU statistics department will be hosting a panel on experience with online statistics education, including massive open online courses:
Posted by crshalizi at April 24, 2013 22:54 | permanent link
In which the relationship (if any) between GDP growth and government debt forms a bridge between causal inference and time series analysis.
Update, 2 May: I have given up on posting solutions, but in this case I will make an exception, as well as directing the reader to the relevant comment form Biostatistics Ryan Gosling.
Posted by crshalizi at April 23, 2013 11:50 | permanent link
Estimating graphical models: substituting consistent estimators into the formulas for front and back door identification; average effects and regression; tricks to avoid estimating marginal distributions; propensity scores and matching and propensity scores as computational short-cuts in back-door adjustment. Instrumental variables estimation: the Wald estimator, two-stage least-squares. Summary recommendations for estimating causal effects.
Reading: Notes, chapter 23
Posted by crshalizi at April 23, 2013 10:30 | permanent link
In which the arts of estimating causal effects from observational data are practiced on Sesame Street.
Posted by crshalizi at April 16, 2013 11:50 | permanent link
Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Summary recommendations for identifying causal effects.
Reading: Notes, chapter 22
Optional reading: Pearl, "Causal Inference in Statistics", sections 3.3--3.5, 4, and 5.1
Posted by crshalizi at April 16, 2013 10:30 | permanent link
Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.
Reading: Notes, chapter 21
Optional reading: Cox and Donnelly, chapter 9; Pearl, "Causal Inference in Statistics", section 1, 2, and 3 through 3.2
Posted by crshalizi at April 11, 2013 10:30 | permanent link
Exam 2: in which we examine how the citizens of ex-communist country X look at history and human rights, as a way of practicing multivariate data analysis.
Assignment; the data set is still confidential and so not public.
Posted by crshalizi at April 09, 2013 11:50 | permanent link
Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?
Reading: Notes, chapter 20
Posted by crshalizi at April 09, 2013 10:30 | permanent link
All of the statistics department's seminars are, of course, fascinating presentations of important work, but next week's could hardly be more relevant to my interests if I had arranged it myself.
Posted by crshalizi at April 04, 2013 13:29 | permanent link
From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.
Reading: Notes, chapter 19; mixture-examples.R
Posted by crshalizi at April 02, 2013 10:30 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Networks; Commit a Social Science
Posted by crshalizi at March 31, 2013 23:59 | permanent link
Adding noise to PCA to get a statistical model. The factor model, or linear regression with unobserved independent variables. Assumptions of the factor model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. Our first look at latent variables and conditional independence. Geometrically, the factor model says the data cluster on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.
Reading: Notes, chapter 18; factors.R and sleep.txt
Posted by crshalizi at March 28, 2013 10:30 | permanent link
Homework 8: in which returning to paleontology gives us an excuse to work with simulations, and to compare distributions.
Posted by crshalizi at March 26, 2013 11:50 | permanent link
Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
Reading: Notes, chapter 17; pca.R, pca-examples.Rdata, and cars-fixed04.dat
Posted by crshalizi at March 26, 2013 10:30 | permanent link
Applying the correct CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.
Reading: Notes, chapter 16
Optional reading: Bera and Ghosh, "Neyman's Smooth Test and Its Applications in Econometrics"; Handcock and Morris, "Relative Distribution Methods"
Posted by crshalizi at March 21, 2013 01:54 | permanent link
Homework 7: in which we try to predict political orientation
from bumps on the skull the volume of brain regions determined
by MRI and adjusted by (unknown) formulas.
Posted by crshalizi at March 20, 2013 20:30 | permanent link
The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties: convergence on the true density if the bandwidth shrinks at the right rate; superior performance to histograms; the curse of dimensionality again. An example with cross-country economic data. Kernels for discrete variables. Estimating conditional densities; another example with the OECD data. Some issues with likelihood, maximum likelihood, and non-parametric estimation. Simulating from kernel density estimates and from histograms.
Reading: Notes, chapter 15
Posted by crshalizi at March 19, 2013 10:30 | permanent link
\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Expectwrt}[2]{\mathbb{E}_{#2}\left[ #1 \right]} \newcommand{\Varwrt}[2]{\mathrm{Var}_{#2}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} \]
Attention conservation notice: > 1800 words on basic statistical theory. Full of equations, and even a graph and computer code, yet mathematically sloppy.
The basic ideas underlying asymptotic estimation theory are very simple; most presentations rather cloud the basics, because they include lots of detailed conditions needed to show rigorously that everything works. In the spirit of the world's simplest ergodic theorem, let's talk about estimation.
We have a statistical model, which tells us, for each sample size \( n \), the probability that the observations \( X_1, X_2, \ldots X_n \equiv X_{1:n} \) will take on any particular value \( x_{1:n} \), or the probability density if the observables are continuous. This model contains some unknown parameters, bundled up into a single object \( \theta \), which we need to calculate those probabilities. That is, the model's probabilities are \( m(x_{1:n};\theta) \), not just \( m(x_{1:n}) \). Because this is just baby stats., we'll say that there are only a finite number of unknown parameters, which don't change with \( n \), so \( \theta \in \mathbb{R}^d \). Finally, we have a loss function, which tells us how badly the model's predictions fail to align with the data: \[ \lambda_n(x_{1:n}, m(\cdot ;\theta)) \] For instance, each \( X_i \) might really be a \( (U_i, V_i) \) pair, and we try to predict \( V_i \) from \( U_i \), with loss being mean-weighted-squared error: \[ \lambda_n = \frac{1}{n}\sum_{i=1}^{n}{\frac{{\left( v_i - \Expectwrt{V_i|U_i=u_i}{\theta}\right)}^2}{\Varwrt{V_i|U_i=u_i}{\theta}}} \] (If I don't subscript expectations \( \Expect{\cdot} \) and variances \( \Var{\cdot} \) with \( \theta \), I mean that they should be taken under the true, data-generating distribution, whatever that might be. With the subscript, calculate assuming that the \( m(\cdot; \theta) \) distribution is right.)
Or we might look at the negative mean log likelihood, i.e., the cross-entropy, \[ \lambda_n = -\frac{1}{n}\sum_{i=1}^{n}{\log{m(x_i|x_{1:i-1};\theta)}} \]
Being simple folk, we try to estimate \( \theta \) by minimizing the loss: \[ \widehat{\theta}_n = \argmin_{\theta}{\lambda_n} \]
We would like to know what happens to this estimate as \( n \) grows. To do this, we will make two assumptions, which put us at the mercy of two sets of gods.
The first assumption is about what happens to the loss functions. \( \lambda_n \) depends both on the parameter we plug in and on the data we happen to see. The later is random, so the loss at any one \( \theta \) is really a random quantity, \( \Lambda_n(\theta) = \lambda_n(X_{1:n},\theta) \). Our first assumption is that these random losses tend towards non-random limits: for each \( \theta \), \[ \Lambda_n(\theta) \rightarrow \ell(\theta) \] where \( \ell \) is a deterministic function of \( \theta \) (and nothing else). It doesn't particularly matter to the argument why this is happening, though we might have our suspicions1, just that it is. This is an appeal to the gods of stochastic convergence.
Our second assumption is that we always have a unique interior minimum with a positive-definite Hessian: with probability 1, \[ \begin{eqnarray*} \nabla {\Lambda_n}(\widehat{\theta}_n) & = & 0\\ \nabla \nabla \Lambda_n (\widehat{\theta}_n) & > & 0 \end{eqnarray*} \] (All gradients and other derviatives will be with respect to \( \theta \); the dimensionality of \( x \) is irrelevant.) Moreover, we assume that the limiting loss function \( \ell \) also has a nice, unique interior minimum at some point \( \theta^* \), the minimizer of the limiting, noise-free loss: \[ \begin{eqnarray*} \theta^* & = & \argmin_{\theta}{\ell}\\ \nabla {\ell}(\theta^*) & = & 0\\ \nabla \nabla \ell (\theta^*) & > & 0 \end{eqnarray*} \] Since the Hessians will be important, I will abbreviate \( \nabla \nabla \Lambda_n (\widehat{\theta}_n) \) by \( \mathbf{H}_n \) (notice that it's a random variable), and \( \nabla \nabla \ell (\theta^*) \) by \( \mathbf{j} \) (notice that it's not random).
These assumptions about the minima, and the derivatives at the minima, place us at the mercy of the gods of optimization.
To see that these assumptions are not empty, here's an example. Suppose that our models are Pareto distributions for \( x \geq 1 \), \( m(x;\theta) = (\theta - 1) x^{-\theta} \). Then \( \lambda_n(\theta) = \theta \overline{\log{x}}_n - \log{(\theta - 1)} \), where \( \overline{\log{x}}_n = n^{-1} \sum_{i=1}^{n}{\log{x_i}} \), the sample mean of the log values. From the law of large numbers, \( \ell(\theta) = \theta \Expect{\log{X}} - \log{(\theta-1)} \). To show the convergence, the figure plots \( \lambda_{10} \), \( \lambda_{1000} \) and \( \lambda_{10^5} \) for a particular random sample, and the corresponding \( \ell \). I chose this example in part because the Pareto distribution is heavy tailed, and I actually generated data from a parameter value where the variance of \( X \) is infinite (or undefined, for purists). The objective functions, however, converge just fine.
Convergence of objective functions, here, negative average log-likelihoods. (Click on the image for the generating R code.) Note that the limiting, \( n = \infty \) objective function (solid blue line) is extremely close to what we see at \( n = 10^5 \) already. (See here, or more generally here, for pareto.R.) |
With these assumptions made, we use what is about the only mathematical
device employed in statistical theory at this level, which is a Taylor
expansion. Specifically, we expand the gradient \( \nabla \Lambda_n \) around
\( \theta^* \):
\[
\begin{eqnarray*}
\nabla {\Lambda_n}(\widehat{\theta}_n) & = & 0\\
& \approx & \nabla {\Lambda_n}(\theta^*) + \mathbf{H}_n(\widehat{\theta}_n - \theta^*)\\
\widehat{\theta}_n & = & \theta^* - \mathbf{H}_n^{-1}\nabla {\Lambda_n}(\theta^*)
\end{eqnarray*}
\]
The first term on the right hand side, \( \theta^* \), is the
population/ensemble/true minimizer of the loss. If we had \( \ell \) rather
than just \( \Lambda_n \), we'd get that for the location of the minimum. But
since we see \( \ell
\) through
a glass, darkly corrupted by noise, we need to include the extra
term \( - \mathbf{H}_n^{-1}\nabla {\Lambda_n}(\theta^*) \). The Hessian \(
\mathbf{H}_n \) tells us how sharply curved \( \Lambda_n \) is near its
minimum; the bigger this is, the easier, all else being equal, to find the
location of the minimum. The other factor, \( \nabla {\Lambda_n}(\theta^*) \),
indicates how much noise there is — not so much in the function being
minimized, as in its gradient, since after all \( \nabla {\ell}(\theta^*) = 0
\). We would like \( \widehat{\theta}_n \rightarrow \theta^* \), so we have to
convince ourselves that the rest is asymptotically negligible, that \(
\mathbf{H}_n^{-1}\nabla {\Lambda_n}(\theta^*) = o(1) \).
Start with the Hessian. \( \mathbf{H}_n \) is the matrix of second derivatives of a random function: \[ \mathbf{H}_n(\widehat{\theta}_n) = \nabla \nabla \Lambda_n(\widehat{\theta}_n) \] Since \( \Lambda_n \rightarrow \ell \), it would be reasonable to hope that \[ \mathbf{H}_n(\widehat{\theta}_n) \rightarrow \nabla \nabla \ell(\widehat{\theta}_n) = \mathbf{j}(\widehat{\theta}_n) \] We'll assume that everything is nice ("regular") enough to let us "exchange differentiation and limits" in this way. Since \( \mathbf{H}_n(\widehat{\theta}_n) \) is tending to \( \mathbf{j}(\widehat{\theta}_n) \), it follows that \( \mathbf{H}_n = O(1) \), and consequently \( \mathbf{H}_n^{-1} = O(1) \) by continuity. With more words: since \( \Lambda_n \) is tending towards \( \ell \), the curvature of the former is tending towards the curvature of the latter. But this means that the inverse curvature is also stabilizing.
Our hope then has to be the noise-in-the-gradient factor, \( \nabla {\Lambda_n}(\theta^*) \). Remember again that \[ \nabla \ell (\theta^*) = 0 \] and that \( \Lambda_n \rightarrow \ell \). So if, again, we can exchange taking derivatives and taking limits, we do indeed have \[ \nabla {\Lambda_n}(\theta^*) \rightarrow 0 \] and we're done. More precisely, we've established consistency: \[ \widehat{\theta}_n \rightarrow \theta^* \]
Of course it's not enough just to know that an estimate will converge, one also wants to know something about the uncertainty in the estimate. Here things are mostly driven by the fluctuations in the noise-in-the-gradient term. We've said that \( \nabla {\Lambda_n}(\theta^*) = o(1) \); to say anything more about the uncertainty in \( \widehat{\theta}_n \), we need to posit more. It is very common to be able to establish that \( \nabla {\Lambda_n}(\theta^*) = O(1/\sqrt{n}) \), often because \( \Lambda_n \) is some sort of sample- or time- average, as in my examples above, and so an ergodic theorem applies. In that case, because \( \mathbf{H}_n^{-1} = O(1) \), we have \[ \widehat{\theta}_n - \theta^* = O(1/\sqrt{n}) \]
If we can go further, and find \[ \Var{\nabla {\Lambda_n}(\theta^*)} = \mathbf{k}_n \] then we can get a variance for \( \widehat{\theta}_n \), using propagation of error: \[ \begin{eqnarray*} \Var{\widehat{\theta}_n} & = & \Var{\widehat{\theta}_n - \theta^*}\\ & = & \Var{-\mathbf{H}_n^{-1} \nabla {\Lambda_n}(\theta^*)}\\ & \approx & \mathbf{j}^{-1}(\theta^*) \Var{ \nabla {\Lambda_n}(\theta^*)} \mathbf{j}^{-1}(\theta^*)\\ & = & \mathbf{j}^{-1} \mathbf{k}_n \mathbf{j}^{-1} \end{eqnarray*} \] the infamous sandwich covariance matrix. If \( \nabla {\Lambda_n}(\theta^*) = O(1/\sqrt{n} ) \), then we should have \( n \mathbf{k}_n \rightarrow \mathbf{k} \), for a limiting variance \( \mathbf{k} \). Then \( n \Var{\widehat{\theta}_n} \rightarrow \mathbf{j}^{-1} \mathbf{k} \mathbf{j}^{-1} \).
Of course we don't know \( \mathbf{j}(\theta^*) \), because that involves the parameter we're trying to find, but we can estimate it by \( \mathbf{j}(\widehat{\theta}_n) \), or even by \( \mathbf{H}_n^{-1}(\widehat{\theta}_n) \). That still leaves getting an estimate of \( \mathbf{k}_n \). If \( \Lambda_n \) is an average over the \( x_i \), as in my initial examples, then we can often use the variance of the gradients at each data point as an estimate of \( \mathbf{k}_n \). In other circumstances, we might actually have to think.
Finally, if \( \nabla \Lambda_n(\theta^*) \) is asymptotically Gaussian, because it's governed by a central limit theorem, then so is \( \widehat{\theta}_n \), and we can use Gaussians for hypothesis testing, confidence regions, etc.
A case where we can short-circuit a lot of thinking is when the model is correctly specified, so that the data-generating distribution is \( m(\cdot;\theta^*) \), and the loss function is the negative mean log-likelihood. (That is, we are maximizing the likelihood.) Then the negative of the limiting Hessian \( \mathbf{j} \) is the Fisher information. More importantly, under reasonable conditions, one can show that \( \mathbf{j} = \mathbf{k} \), that the variance of the gradient is exactly the limiting negative Hessian. Then the variance of the estimate simplifies to just \( \mathbf{j}^{-1} \). This turns out to actually be the best variance you can hope for, at least with unbiased estimators (the Cramér-Rao inequality). But the bulk of the analysis doesn't depend on the fact that we're estimating by maximum likelihood; it goes the same way for minimizing any well-behaved objective function.
Now, there are a lot of steps here where I am being very loose. (To begin with: In what sense is the random function \( \Lambda_n \) converging on \( \ell \), and does it have to converge everywhere, or just in a neighborhood of \( \theta^* \)?) That is, I am arguing like a physicist. Turning this sketch into a rigorous argument is the burden of good books on asymptotic statistics. But this is the core. It does not require the use of likelihood, or correctly specified models, or independent data, just that the loss function we minimize be converging, in a well-behaved way, to a function which is nicely behaved around its minimum.
Further reading:
[1]: "In fact, all epistemologic value of the theory of the probability is based on this: that large-scale random phenomena in their collective action create strict, nonrandom regularity." — B. V. Gnedenko and A. N. Kolmogorov, Limit Distributions for Sums of Independent Random Variables, p. 1. On the other hand, sometimes we can get limits like this as our sensors get better tuned to the signal. ^
Manual trackback: Homoclinic Orbit
Posted by crshalizi at March 18, 2013 00:20 | permanent link
A speaker who needs no introduction (but will get one), on a topic whose closeness to my heart needs no elaboration (but will get it):
Memo to self: ask Efron what he thinks of Fraser's "Is Bayes Posterior Just Quick and Dirty Confidence?" (arxiv:1112.5582).
Posted by crshalizi at March 16, 2013 21:24 | permanent link
Attention conservation notice: I look back on my works with smug complacency.
I started this weblog in January 2003; I don't remember exactly when, and date on files got messed up by various changes from Blogger to Movable Type to Blosxom, where it has remained ever since. So let's say, ten years and a bit. I have also just turned 39, so I will indulge myself by looking back at this companion of my thirties.
It's been a big undertaking — 600,000 words in 1071 posts before this — but it's also been good to me. It's gotten me recognition and prizes, and friends, and even helped me get my job, which I love. There have been downsides, over and above the time and effort, but they're mostly personal, and I've learned not to broadcast those. If I had realized how much of my public persona this was going to be, I'd have thought more about giving it such an obscure name (it comes from an old family joke, about how I did everything very slowly as a boy), but I like to think I'd have done it in the end.
As far as I can judge, the blog's best days were 2005--2010, after I'd learned how to use the medium but before work and (what else?) sloth reduced me to merely posting notices about lecture notes. I imagine I will keep it up for the next ten years, in some mode or other, if only because I am a creature of habit. Whether anyone will still bother reading it then, who knows?
Have I, as the poet asks, ever really helped anybody but myself with this? If not, it wouldn't be worth much. I feel that I've never really gotten across why I do this, and I'm not very happy with what follows, but it's better than the discarded drafts at least: Through no merit of my own (at best, persistence in taking exploiting luck), I have a position and skills which mean a few people will pay some attention to me about a handful of subjects, and this obliges me to try to do some good with this sliver of influence. (I do less good in the world than those following any of a hundred thankless and anonymous callings — which I would hate and be bad at.) The means I am most comfortable with are negative — critique, debunking, sarcasm; only rarely do I praise or build up, and my efforts in those directions are unconvincing even to me. My negative posts have helped give me a reputation for erudition and for venom, but their value, if they have any, is helping readers see how they could do better themselves. We have it in ourselves, together, to discover wonders and create marvels, and yet our world buries us in nonsense and inflicts pointless cruelty. We can do better, and I hope what I write helps, in the tiniest of ways, to help my readers find their way there.
I'll finish by being really self-indulgent, and pointing out what I think of as twenty of my best pieces. (I'd pick a different twenty another day.)
Manual trackback: the blog formerly known as The Statistical Mechanic; Brad DeLong
Posted by crshalizi at March 16, 2013 21:11 | permanent link
Reminders about multivariate distributions. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; conditional distributions in multivariate Gaussians and linear regression; computational aspects, specifically in R. General methods for estimating parametric distributional models in arbitrary dimensions: moment-matching and maximum likelihood; asymptotics of maximum likelihood; bootstrapping; model comparison by cross-validation and by likelihood ratio tests; goodness of fit by the random projection trick.
Reading: Notes, chapter 14
Posted by crshalizi at March 05, 2013 10:30 | permanent link
My reading this month was either student papers, university busy-work, or else unsatisfying, with two exceptions.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica
Posted by crshalizi at February 28, 2013 23:59 | permanent link
Iteratively re-weighted least squares for logistic regression re-examined: coping with nonlinear transformations and model-dependent heteroskedasticity. The common pattern of generalized linear models and IRWLS. Binomial and Poisson regression. The extension to generalized additive models.
Extended example: building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
Reading: Notes, chapter 13
Faraway, section 3.1, chapter 6, chapter 7
Posted by crshalizi at February 28, 2013 10:30 | permanent link
In which we compare the power-law scaling model of urban economies due to Bettencourt, West, et al. to an alternative in which city size is actually irrelevant.
This was a one-week take-home exam, intended to use more or less everything taught so far.
Assignment, master data set (each student got a random subset of this)
Posted by crshalizi at February 26, 2013 11:50 | permanent link
Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model; maximum likelihood; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares; comparing logistic regression to logistic-additive models.
Reading: Notes, chapter 12
Faraway, chapter 2 (skipping sections 2.11 and 2.12)
Posted by crshalizi at February 26, 2013 10:30 | permanent link
In which extinct charismatic megafauna give us an excuse to specification test.
Posted by crshalizi at February 19, 2013 11:50 | permanent link
Non-parametric smoothers can be used to test parametric models. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using simulation from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
Reading: Notes, chapter 10; R for in-class demos
Cox and Donnelly, chapter 7
Optional reading: Spain et al., "Testing the Form of Theoretical Models by Relaxing Assumptions: Comparing Parametric and Nonparametric Models", ssrn/2164297
Posted by crshalizi at February 19, 2013 10:30 | permanent link
Blogging will stay non-existent sporadic while
I struggle to get enough ahead of the 96 students
in ADA that I can do some research
devote myself to contemplating the mysteries of the universe and helping young
minds develop their own powers. In the meanwhile, if you want more:
For example, one open question now is: How can an Artificial Intelligence do statistics? In the old-fashioned view of Bayesian data analysis as inference-within-a-supermodel, it's simple enough: an AI (or a brain) just runs a Stan-like program to learn from the data and make predictions as necessary. But in a modern view of Bayesian data analysis — iterating the steps of model-building, inference-within-a-model, and model-checking — here, it's not quite clear how the AI works. It needs not just an inference engine, but also a way to construct new models and a way to check models. Currently, those steps are performed by humans, but the AI would have to do it itself, without the aid of a "homunculus" to come up with new models or check the fit of existing ones. This philosophical quandary points to new statistical methods, for example a language-like approach to recursively creating new models from a specified list of distributions and transformations, and an automatic approach to checking model fit, based on some way of constructing quantities of interest and evaluating their discrepancies from simulated replications.I also don't know how to teach a computer to do applied statistics, obviously, or else I'd be doing it. (Or trying to, teaching permitting.) My guess is that chunks of the come-up-with-new-models part can be done through evolutionary processes. (What's Bayesian updating itself, after all?) As for model checking, there are (at least) two highly non-trivial issues. One is deciding which aspects of the model to check — strategy, or at least tactics, in the choice of test. This somehow feels like the most important piece, but also the one where I have the least notion of how to articulate what a good data analyst does. The other, perhaps less vital direction for automating model checking is devising tests which don't just say "the model is broken" (when and only when it is broken), but at least hint at how to fix the model. It's this which attracts me both to Neyman smooth tests (for distributions) and to the PC algorithm (for conditional independence graphs). In both cases, when they tell you to junk your model, they give you a very strong indication of how it can be improved. We tried for something similar in CSSR, with what success it's not for me to say.
The goal of this workshop is to bring together mathematicians, physicists, and social, information, and computer scientists to explore the dynamics of social learning and cultural evolution. Of particular interest will be ways of using data from social media and online experiments to address questions of interest, which include but are not limited to: How do individual attributes and cognitive constraints affect the dynamics and evolution of social behavior? How does network structure both within and between groups (including online networks and communities) affect social learning and cultural evolution? What are the similarities and differences between cultural and genetic evolution? How do social norms emerge and evolve? What are the main mechanisms driving social learning and the evolution of culture?This will happen in January 2014, which currently seems like part of the same mythical future as fusion power, human-like AI, and Brazilian world domination, but unlike them all will actually arrive...
Manual trackbacks (of sorts): The Browser; Brad DeLong; Radio Free Europe/Radio Liberty (!); Marginal Revolution
Self-centered; Bayes, anti-Bayes; The Collective Use and Evolution of Concepts; The Progressive Forces
Posted by crshalizi at February 18, 2013 15:30 | permanent link
The "curse of dimensionality" limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimension. Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are unconstrained. This goes beyond linear models but still evades the curse of dimensionality. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. (This is like automatically finding an optimal transformation of each predictor before putting it into a linear model.) Examples in R using the California house-price data. Conclusion: there are no statistical reasons to prefer linear models to additive models, hardly any scientific reasons, and increasingly few computational ones; the continued thoughtless use of linear regression is a scandal.
Reading: Notes, chapter 9
Optional reading: Faraway, chapter 12
Posted by crshalizi at February 14, 2013 10:30 | permanent link
In which spline regression becomes a matter of life and death in Chicago.
Posted by crshalizi at February 12, 2013 11:30 | permanent link
Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression.
Reading: Notes, chapter 8
Optional reading: Faraway, section 11.2.
Posted by crshalizi at February 12, 2013 10:30 | permanent link
Weighted least squares estimates, to give more emphasis to particular data points. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
Reading: Notes, chapter 7
Optional reading: Faraway, section 11.3.
Posted by crshalizi at February 07, 2013 10:30 | permanent link
In which extinct charismatic megafauna give us an excuse to practice basic programming, fitting nonlinear models, and bootstrapping.
Assignment, nampd.csv data set, R
Posted by crshalizi at February 05, 2013 11:30 | permanent link
R programs are built around functions: pieces of code that take inputs or arguments, do calculations on them, and give back outputs or return values. The most basic use of a function is to encapsulate something we've done in the terminal, so we can repeat it, or make it more flexible. To assure ourselves that the function does what we want it to do, we subject it to sanity-checks, or "write tests". To make functions more flexible, we use control structures, so that the calculation done, and not just the result, depends on the argument. R functions can call other functions; this lets us break complex problems into simpler steps, passing partial results between functions. Programs inevitably have bugs: debugging is the cycle of figuring out what the bug is, finding where it is in your code, and fixing it. Good programming habits make debugging easier, as do some tricks. Avoiding iteration. Re-writing code to avoid mistakes and confusion, to be clearer, and to be more flexible.
Reading: Notes, Appendix A
Optional reading: Slides from 36-350, introduction to statistical computing, especially through lecture 18.
Posted by crshalizi at February 05, 2013 10:30 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; The Collective Use and Evolution of Concepts; The Beloved Republic; Pleasures of Detection, Portraits of Crime; Natural Science of the Human Species
Posted by crshalizi at January 31, 2013 23:59 | permanent link
The sampling distribution is the source of all knowledge regarding statistical uncertainty. Unfortunately, the true sampling distribution is inaccessible, since it is a function of exactly the quantities we are trying to infer. One exit from this vicious circle is the bootstrap principle: approximate the true sampling distribution by simulating from a good model of the process, and treating the simulation data just like the data. The simplest form of this is parametric bootstrapping, i.e., simulating from the fitted model. Nonparametric bootstrapping means simulating by re-sampling, i.e., by treating the observed sample as a complete population and drawing new samples from it. Bootstrapped standard errors, biases, confidence intervals, p-values. Tricks for making the simulated distribution closer to the true sampling distribution (pivotal intervals, studentized intervals, the double bootstrap). Bootstrapping regression models: by parametric bootstrapping; by resampling residuals; by resampling cases. Many, many examples. When does the bootstrap fail?
Note: Thanks to Prof. Christopher Genovese for delivering this lecture while I was enjoying the hospitality of the fen-folk.
Reading:
Notes, chapter 6
(R for figures and examples;
pareto.R;
wealth.dat);
Lecture slides;
R for in-class examples
Cox and Donnelly, chapter 8
Posted by crshalizi at January 31, 2013 10:30 | permanent link
In which, by way of getting comfortable with simulations and the bootstrap, we look at just how detached from reality standard financial models are. (And, in the hidden curriculum, get comfortable writing R functions.)
Posted by crshalizi at January 29, 2013 11:30 | permanent link
Simulation is implementing the story encoded in the model, step by step, to produce something which should look like the data. Stochastic models have random components and so require some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random variables. Methods for generating random variables with specified distributions: the transformation or inverse-quantile method; the rejection method; Markov chain Monte Carlo (Metropolis or Metropolis-Hastings method). Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? Simulation-based estimation: the method of simulated moments.
Reading: Notes, chapter 5 (but sections 5.4--5.6 are optional this year); R
Posted by crshalizi at January 29, 2013 10:30 | permanent link
Attention conservation notice: Self-promotion of an academic talk, based on a three-year-old paper, on the arcana of how Bayesian methods work from a frequentist perspective.
Because is snowing relentlessly and the occasional bout of merely freezing air is a blessed relief, I will be escaping to a balmier clime next week: Cambridgeshire.
Manual trackback: Brad DeLong
Posted by crshalizi at January 26, 2013 22:24 | permanent link
In which we try to discern whether poor countries grow faster, by way of practicing with data-set-splitting, cross-validation, and nonparametric smoothing.
Assignment, R, penn-select.csv data set
Incidentally, I have no idea who is responsible for this —
— and would prefer not to be told.#memerizing shalizi's homeworks quickmeme.com/meme/3snfm1/
— stratosaur (@stratosaur) January 22, 2013
Posted by crshalizi at January 26, 2013 21:38 | permanent link
Lecture 4: The bias-variance trade-off tells us how much we should smooth. Some heuristic calculations with Taylor expansions for general linear smoothers. Adapting to unknown roughness with cross-validation; detailed examples. How quickly does kernel smoothing converge on the truth? Using kernel regression with multiple inputs. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.
Reading: Notes, chapter 4 (R)
Optional readings: Faraway, section 11.1; Hayfield and Racine, "Nonparametric Econometrics: The np Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]
Posted by crshalizi at January 26, 2013 21:37 | permanent link
Lecture 3, Model evaluation: error and inference. Statistical models have
three main uses: as ways of summarizing (reducing, compressing) the data; as
scientific models, facilitating actually scientific inference; and as
predictors. Both summarizing and scientific inference are linked to prediction
(though in different ways), so we'll focus on prediction. In particular for
now we focus on the
Reading: Notes, chapter 3 (R)
Cox and Donnelly, ch. 6
Posted by crshalizi at January 26, 2013 21:36 | permanent link
Lecture 2, The truth about linear regression: Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions (non-Gaussian noise, non-independent noise, non-constant variance, dependent predictors). Linear regression coefficients will change with the distribution of the input variables: examples. Why \( R^2 \) is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What it really means when coefficients are significantly non-zero. What "controlled for in a linear regression" really means.
Reading: Notes, chapter 2 (R); Notes, appendix B
Optional reading: Faraway, rest of chapter 1
Posted by crshalizi at January 26, 2013 21:35 | permanent link
I can't claim he was a close friend. We knew each other through a correspondence which began in 2006. We only met face to face once, when I was able to break away from a conference in the city he was living in for an afternoon of conversation. He was the same modest, charming, morally serious, funny and multi-facetedly brilliant young man in person that he was on paper.
For whatever unfathomable reason, he liked my posts. For my part, when I write I try to imagine the reaction of an ideal audience of a few people, and Aaron was one of them. More: I've often asked myself why I don't do as much to act on my convictions as he did; he pricked my conscience. If, on Friday morning, you'd asked me which, if any, of my acquaintances would ever be remembered as a great American, I'd have named him without even thinking about it. Who else was so ridiculously accomplished at such an age, so devoted to good principles without being a prig, so focused on removing the real stumbling blocks to progress?
I have been angry with Aaron, off and on but always stupidly, ever since I heard the news. But I am furious at the idiots who hounded him to death for the sake of the profits of parasites, and the restriction of knowledge.
Posted by crshalizi at January 16, 2013 23:35 | permanent link
In which students who have spent half a year learning linear
regression are tormented we explore some of the limitations of linear
models, by way of warming up for more flexible ones.
Posted by crshalizi at January 16, 2013 23:18 | permanent link
Statistics is the branch of mathematical engineering which designs and analyzes methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
Reading: Notes, chapter 1 (examples.dat for running example; ckm.csv data set for optional exercises)
Posted by crshalizi at January 16, 2013 23:15 | permanent link
Attention conservation notice: Navel-gazing.
Paper manuscripts completed: 5
Papers accepted: 2
Papers rejected: 2 (fools! we'll show you all!)
Papers in refereeing limbo: 5
Papers with co-authors waiting for me to revise: 3
Invited papers finished: 2
Invited papers in progress: 2
Other papers in progress: I won't look in that directory and you can't
make me
Grant proposals submitted: 3
Grant proposals funded: 1 (from last year)
Grant proposals rejected: 1 (fools! we'll show you all!)
Grant proposals in refereeing limbo: 2
Grant proposals in progress for next year: 1
Grant proposals refereed: 16
Talk given and conferences attended: 23, in 13 cities
Classes taught: 2 [i, ii]
New classes taught: 0
Summer school classes taught: 1
New summer school classes taught: 1
Pages of new course material written: about 400
Manuscripts refereed: 47
Number of times I was asked to referee my own manuscript: 2
Number of times I did so: 0
Manuscripts waiting for me to referee: 6
Manuscripts for which I was the responsible associate editor
at Annals of Applied
Statistics: 4
Senior program committees joined: 1
Book proposals reviewed: 3
Book proposals submitted: 1 (see "New course material", above)
Book proposals accepted: 1
Book manuscripts completed: 0
Students who completed their dissertations: 2 (i, ii)
Students preparing to propose in the spring: 2
Letters of recommendation sent: 40+
Promotion received: 1 (to associate professor without tenure)
Months until tenure packet is due: 18
Book reviews published on dead trees: 1
Weblog posts: 133
Substantive posts: 40, counting algal growths
Incomplete posts in the drafts folder: 36
Incomplete posts transferred to the papers-in-progress folder: 2
Weblog posts that won prizes: 1
Thoughtful, detailed guest posts by eminent statisticians which provoked responses from Nobel-Prize-winning economists at Princeton: 1
Splenetic comments written elsewhere, after too much to drink at a party, which provoked responses from Nobel-Prize-winning economists at Princeton: 1
Books acquired: 320
Books begun: 188
Books finished: 160
Books given up: 5
Books sold: 1040 (not a typo)
Books donated: 130
Legal recognitions of major life transitions: 1
Actual major life transitions: 0
Real estate transactions: 1
Posted by crshalizi at January 11, 2013 22:25 | permanent link