At an intersection of Enigmas of Chance and Corrupting the Young.
Self-Evaluation and Lessons Learned
Class announcement Lectures with no links haven't been delivered yet, and the order an topics may change.
Posted by crshalizi at September 15, 2014 22:38 | permanent link
In which we practice working with data frames, grapple with some of the subtleties of R's system of data types, and think about how to make sequences.
(Hidden agendas: data cleaning; practice using R Markdown; practice reading R help files)
Assignment, due at 11:59 pm on Thursday, 4 September 2014
Posted by crshalizi at August 29, 2014 11:30 | permanent link
In which we play around with basic data structures and convince ourself that the laws of probability are, in fact, right. (Or perhaps that R's random number generator is pretty good.) Also, we learn to use R Markdown.
— Getting everyone randomly matched for pair programming with a deck of cards worked pretty well. It would have worked better if the university's IT office hadn't broken R on the lab computers.
Lab (and its R Markdown source)
Posted by crshalizi at August 29, 2014 10:30 | permanent link
Matrices as a special type of array; functions for matrix arithmetic and algebra: multiplication, transpose, determinant, inversion, solving linear systems. Using names to make calculations clearer and safer: resource-allocation mini-example. Lists for combining multiple types of values; access sub-lists, individual elements; ways of adding and removing parts of lists. Lists as key-value pairs. Data frames: the data structure for classic tabular data, one column per variable, one row per unit; data frames as hybrids of matrices and lists. Structures of structures: using lists recursively to creating complicated objects; example with eigen.
Posted by crshalizi at August 27, 2014 10:30 | permanent link
Introduction to the course: statistical programming for autonomy, honesty, and clarity of thought. The functional programming idea: write code by building functions to transform input data into desired outputs. Basic data types: Booleans, integers, characters, floating-point numbers. Operators as basic functions. Variables and names. Related pieces of data are bundled into larger objects called data structures. Most basic data structures: vectors. Some vector manipulations. Functions of vectors. Naming of vectors. Our first regression. Subtleties of floating point numbers and of integers.
Posted by crshalizi at August 25, 2014 11:30 | permanent link
Fourth time is charm:
Further details can be found at the class website. Teaching materials (lecture slides, homeworks, labs, etc.), will appear both there and here.
— The class is much bigger than in any previous year --- we currently have 50 students enrolled in two back-to-back lecture sections, and another twenty-odd on the waiting list, pending more space for labs. Most of the ideas tossed out in my last self-evaluation are going to be at least tried; I'm particularly excited about pair programming for the labs. Also, I at least am enjoying re-writing the lectures in R Markdown's presentation mode.
Manual trackback: Equitablog
Corrupting the Young; Enigmas of Chance; Introduction to Statistical Computing
Posted by crshalizi at August 25, 2014 10:30 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Commit a Social Science; Minds, Brains, and Neurons; The Beloved Republic; The Dismal Science; Corrupting the Young; The Commonwealth of Letters
Posted by crshalizi at July 31, 2014 23:59 | permanent link
Attention conservation notice: Leaden academic sarcasm about methodology.
The following statement was adopted unanimously by the editorial board of the journal, and reproduced here in full:
We wish to endorse, in its entirety and without reservation, the recent essay "On the Emptiness of Failed Replications" by Jason Mitchell. In Prof. Mitchell's field, scientists attempt to detect subtle patterns of association between faint environmental cues and measured behaviors, or to relate remote proxies for neural activity to differences in stimuli or psychological constructs. We are entirely persuaded by his arguments that the experimental procedures needed in these fields are so delicate and so tacit that failures to replicate published findings must indicate incompetence on the part of the replicators, rather than the original report being due to improper experimental technique or statistical fluctuations. While the specific obstacles to transmitting experimental procedures for social priming or functional magnetic resonance imaging are not the same as those for reading the future from the conformation and coloration of the liver of a sacrificed sheep, goat, or other bovid, we see no reason why Prof. Mitchell's arguments are not at least as applicable to the latter as to the former. Instructions to referees for JEBH will accordingly be modified to enjoin them to treat reports of failures to replicate published findings as "without scientific value", starting immediately. We hope by these means to ensure that the field of haruspicy, and perhaps even all of the mantic sciences, is spared the painful and unprofitable controversies over replication which have so distracted our colleagues in psychology.
Questions about this policy should be directed to the editors; I'm just the messenger here.
Manual trackback: Equitablog; Pete Warden
Posted by crshalizi at July 11, 2014 15:50 | permanent link
Attention conservation notice: I have no taste, and I am about to recommend a lot of books.
Somehow, I've not posted anything about what I've been reading since September. So: have October, November, December, January, February, March, April, May, and June.
Posted by crshalizi at July 06, 2014 17:26 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Philosophy; The Running-Dogs of Reaction; Writing for Antiquity; Pleasures of Detection, Portraits of Crime; Learned Folly
Posted by crshalizi at June 30, 2014 23:59 | permanent link
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\zprime}{z^{\prime}} \newcommand{\Zprime}{Z^{\prime}} \newcommand{\Eta}{H} \newcommand{\equdist}{\stackrel{d}{=}} \newcommand{\indep}{\mathrel{\perp\llap{\perp}}} \]
Attention conservation notice: 2700+ words, expounding a mathematical paper on statistical learning theory. Largely written months ago, posted now in default of actual content.
For the CMU statistical learning theory reading group, I decided to present this:
The question being grappled with here is how we can learn from one example, really from one realization of a stochastic process. Our usual approach in statistics and machine learning is to assume we have many, independent examples from the same source. It seems very odd to say that if we see a single big, internally-dependent example, we're as much in the dark about the data source and its patterns as if we'd observed a single one-dimensional measurement, but that's really all a lot of our theory can do for us. Since we know that animals and machines often can successfully learn generalizable patterns from single realizations, there needs to be some explanation of how the trick is turned...
This paper is thus relevant to my interests in dependent learning, time series and spatio-temporal data, and networks. I read it when it first came out, but I wasn't at all convinced that I'd really understood it, which was why I volunteered to present this. Given this, I skipped sections 6 and 7, which specialize from pretty general learning theory to certain kinds of graphical models. It's valuable to show that the assumptions of the general theory can be realized, and by a non-trivial class of models at that, but they're not really my bag.
At a very high level, the strategy used to prove a generalization-error bound here is fairly familiar in learning theory. Start by establishing a deviation inequality for a single well-behaved function. Then prove that the functions are "stable", in the sense that small changes to their inputs can't alter their outputs too much. The combination of point-wise deviation bounds and stability then yields concentration bounds which hold uniformly over all functions. The innovations are in how this is all made to work when we see one realization of a dependent process.
The data here is an $n$-dimensional vector of random variables, $Z = (Z_1, Z_2, \ldots Z_n)$ is an $n$-dimensional object. N.B., $n$ here is NOT number of samples, but the dimensionality of our one example. (I might have preferred something like $p$ here personally.) We do not assume that the $Z_i$ are independent, Markov, exchangeable, stationary, etc., just that $Z$ obeys some stochastic process or other.
We are interested in functions of the whole of $Z$, $g(Z)$. We're going to assume that they have a "bounded difference" property: that if $z$ and $\zprime$ are two realizations of $Z$, which differ in only a single coordinate, then $|g(z) - g(\zprime)| \leq c/n$ for some $c$ which doesn't care about which constant we perturb.
With this assumption, if the $Z_i$ were IID, the ordinary (McDiarmid) bounded differences inequality would say \[ \Prob{g(Z) - \Expect{g(Z)} \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{c^2} \right\} } \]
This sort of deviation inequality is the bread-and-butter of IID learning theory, but now we need to make it work under dependence. This needs a probabilistic assumption: changing one coordinate alone can't change the function $f$ too much, but it mustn't also imply changes to many other coordinates.
The way London et al. quantify this is to use the $\eta$-dependence coefficients introduced by Aryeh "Absolutely Regular" Kontorovich. Specifically, pick some ordering of the $Z_i$ variables. Then the $\eta$-dependence between positions $i$ and $j$ is \[ \eta_{ij} = \sup_{z_{1:i-1}, z_i, \zprime_i}{{\left\|P\left(Z_{j:n}\middle| Z_{1:i-1}= z_{1:i-1}, Z_i = z_i\right) - P\left(Z_{j:n}\middle| Z_{1:i-1}= z_{1:i-1}, Z_i = \zprime_i\right) \right\|}_{TV}} \] I imagine that if you are Aryeh, this is transparent, but the rest of us need to take it apart to see how it works...
Fix $z_{1:i-1}$ for the moment. Then the expression above would say how much can changing $Z_i$ matter for what happens from $j$ onwards; we might call it how much influence $Z_i$ has, in the context $z_{1:i-1}$. Taking the supremum over $z_{1:i-1}$ shows how much influence $Z_i$ could have, if we set things up just right.
Now, for book-keeping, set $\theta_{ij} = \eta_{ij}$ if $i < j$, $=1$ if $i=j$, and $0$ if $i > j$. This lets us say that $\sum_{j=1}^{n}{\theta_{ij}}$ is (roughly) how much influence $Z_i$ could exert over the whole future.
Since we have no reason to pick out a particular $Z_i$, we ask how influential the most influential $Z_i$ could get: \[ \|\Theta_n\|_{\infty} = \max_{i\in 1:n}{\sum_{j=1}^{n}{\theta_{ij}}} \] Because this quantity is important and keeps coming up, while the matrix of $\theta$'s doesn't, I will depart from the paper's notation and give it an abbreviated name, $\Eta_n$.
Now we have the tools to assert Theorem 1 of London et al., which is (as they say) essentially Theorem 1.1 of Kontorovich and Ramanan:
Theorem 1: Suppose that $g$ is a real-valued function which has the bounded-differences property with constant $c/n$. Then \[ \Prob{g(Z) - \Expect{g(Z)} \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{c^2 \Eta_n^2} \right\} } \]That is, the effective sample size is $n/\Eta_n^2$, rather than $n$, because of the dependence between observations. (We have seen similar deflations of the number of effective observations before, when we looked at mixing, and even in the world's simplest ergodic theorem.) I emphasize that we are not assuming any Markov property/conditional independence for the observations, still less that $Z$ breaks up into independent chucks (as in an $m$-dependent sequence). We aren't even assuming a bound or a growth rate for $\Eta_n$. If $\Eta_n = O(1)$, then for each $i$, $\eta_{ij} \rightarrow 0$ as $j \rightarrow \infty$, and we have what Kontorovich and Ramanan call an $\eta$-mixing process. It is not clear whether this is stronger than, say, $\beta$-mixing. (Two nice questions, though tangential here, are whether $\beta$ mixing would be enough, and, if not whether our estimator of $\beta$-mixing be adapted to get $\eta_{ij}$ coefficients?)
To sum up, if we have just one function $f$ with the bounded-difference property, then we have a deviation inequality: we can bound how far below its mean it should be. Ultimately the functions we're going to be concerned with are the combinations of models with a loss function, so we want to control deviations for not just one model but for a whole model class...
(In fact, at some points in the paper London et al. distinguish between the dimension of the data ($n$) and the dimension of the output vector ($N$). Their core theorems presume $n=N$, but I think one could maintain the distinction, just at some cost in notational complexity.)
Ordinarily, when people make stability arguments in learning theory, they have the stability of algorithms in mind: perturbing (or omitting) one data point should lead to only a small change in the algorithm's output. London et al., in contrast, are interested in the stability of hypotheses: small tweaks to $z$ should lead to only small changes in the vector $f(z)$.
Definition. A vector-valued function $f$ is collectively $\beta$-stable iff, when $z$ and $\zprime$ are off-by-one, then $\| f(z) - f(\zprime) \|_1 \leq \beta$. The function class $\mathcal{F}$ is uniformly collectively $\beta$-stable iff every $f \in \mathcal{F}$ is $\beta$-stable.
Now we need to de-vectorize our functions. (Remember, ultimately we're interested in the loss of models, so it would make sense to average their losses over all the dimensions over which we're making predictions.) For any $f$, set \[ \overline{f}(z) \equiv \frac{1}{n}\sum_{i=1}^{n}{f_i(z)} \]
(In what seems to me a truly unfortunate notational choice, London et al. wrote what I'm calling $\overline{f}(z)$ as $F(z)$, and wrote $\Expect{\overline{f}(Z)}$ as $\overline{F}$. I, and much of the reading-group audience, found this confusing, so I'm trying to streamline.)
Now notice that if $\mathcal{F}$ is uniformly $\beta$-stable, if we pick any $f$ in $\mathcal{F}$, its sample average $\overline{f}$ must obey the bounded difference property with constant $\beta/n$. So sample averages of collectively stable functions will obey the deviation bound in Theorem 1.
Can we extend this somehow into a concentration inequality, a deviation bound that holds uniformly over $\mathcal{F}$?
Let's look at the worst case deviation: \[ \Phi(z) = \sup_{f \in \mathcal{F}}{\Expect{\overline{f}(Z)} - \overline{f}(z)} \] (Note: Strictly speaking, $\Phi$ is also a function of $\mathcal{F}$ and $n$, but I am suppressing that in the notation. [The authors included the dependence on $\mathcal{F}$.])
To see why controlling $\Phi$ gives us concentration, start with the fact that, by the definition of $\Phi$, \[ \Expect{\overline{f}(Z)} - \overline{f}(Z) \leq + \Phi(Z) \] so \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Phi(Z) \] not just with almost-surely but always. If in turn $\Phi(Z) \leq \Expect{\Phi(Z)} + \epsilon$, at least with high probability, then we've got \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Expect{\Phi(Z)} + \epsilon \] with the same probability.
There are many ways one could try to show that $\Phi$ obeys a deviation inequality, but the one which suggests itself in this context is that of showing $\Phi$ has bounded differences. Pick any $z, \zprime$ which differ in just one coordinate. Then \begin{eqnarray*} \left|\Phi(z) - \Phi(\zprime)\right| & = & \left| \sup_{f\in\mathcal{F}}{\left\{ \Expect{\overline{f}(Z)} - \overline{f}(z)\right\}} - \sup_{f\in\mathcal{F}}{\left\{ \Expect{\overline{f}(Z)} - \overline{f}(\zprime)\right\}} \right|\\ & \leq & \left| \sup_{f \in \mathcal{F}}{ \Expect{\overline{f}(Z)} - \overline{f}(z) - \Expect{\overline{f}(Z)} + \overline{f}(\zprime)}\right| ~ \text{(supremum over differences is at least difference in suprema)}\\ & = & \left|\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\zprime) - f_i(z)}}\right| \\ &\leq& \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{|f_i(\zprime) - f_i(z)|}} ~ \text{(Jensen's inequality)}\\ & = & \frac{1}{n}\sup_{f\in \mathcal{F}}{\|f(\zprime) - f(z)\|_1} ~ \text{(definition of} \ \| \|_1) \\ & \leq & \frac{\beta}{n} ~ \text{(uniform collective stability)} \end{eqnarray*} Thus Theorem 1 applies to $\Phi$: \[ \Prob{\Expect{\Phi(Z)} - \Phi(Z) \geq \epsilon} \leq \exp{\left\{ -\frac{2n\epsilon^2}{\beta^2 \Eta_n^2} \right\}} \] Set the right-hand side to $\delta$ and solve for $\epsilon$: \[ \epsilon = \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \] Then we have, with probability at least $1-\delta$, \[ \Phi(Z) \leq \Expect{\Phi(Z)} + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \] Hence, with the same probability, uniformly over $f \in \mathcal{F}$, \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + \Expect{\Phi(Z)} + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \]
Our next step is to replace the expected supremum of the empirical process, $\Expect{\Phi(Z)}$, with something more tractable and familiar-looking. Really any bound on this could be used, but the authors provide a particularly nice one, in terms of the Rademacher complexity.
Recall how the Rademacher complexity works when we have a class $\mathcal{G}$ of scalar-valued function $g$ of an IID sequence $X_1, \ldots X_n$: it's \[ \mathcal{R}_n(\mathcal{G}) \equiv \Expect{\sup_{g\in\mathcal{G}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i g(X_i)}}} \] where we introduce the Rademacher random variables $\sigma_i$, which are $\pm 1$ with equal probability, independent of each other and of the $X_i$. Since the Rademacher variables are the binary equivalent of white noise, this measures how well our functions can seem to correlate with noise, and so how well they can seem to match any damn thing.
What the authors do in Definition 2 is adapt the definition of Rademacher complexity to their setting in the simplest possible way: \[ \mathcal{R}_n(\mathcal{F}) \equiv \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}} \] In the IID version of Rademacher complexity, each summand involves applying the same function ($g$) to a different random variable ($X_i$). Here, in contrast, each summand applies a different function ($f_i$) to the same random vector ($Z$). This second form can of course include the first as a special case.
Now we would like to relate the Rademacher complexity somehow to the expectation of $\Phi$. Let's take a closer look at the definition there: \[ \Expect{\Phi(Z)} = \Expect{\sup_{f\in\mathcal{F}}{\Expect{\overline{f}(Z)} - \overline{f}(Z)}} \] Let's introduce an independent copy of $Z$, say $\Zprime$, i.e., $Z \equdist \Zprime$, $Z\indep \Zprime$. (These are sometimes called "ghost samples".) Then of course $\Expect{\overline{f}(Z)} = \Expect{\overline{f}(\Zprime)}$, so \begin{eqnarray} \nonumber \Expect{\Phi(Z)} & = & \Expect{\sup_{f\in\mathcal{F}}{\Expect{\overline{f}(\Zprime)} - \overline{f}(Z)}} \\ \nonumber & \leq & \Expect{\sup_{f\in\mathcal{F}}{\overline{f}(\Zprime) - \overline{f}(Z)}} ~ \text{(Jensen's inequality again)}\\ \nonumber & = & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}}}\\ & = & \Expect{\Expect{ \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}} \middle| \sigma}} ~ \text{(law of total expectation)} \label{eqn:phi-after-symmetrizing} \end{eqnarray} Look at the summands. No matter what $f_i$ might be, $f_i(\Zprime) - f_i(Z) \equdist f_i(Z) - f_i(\Zprime)$, because $Z$ and $\Zprime$ have the same distribution but are independent. Since multiplying something by $\sigma_i$ randomly flips its sign, this suggests we should be able to introduce $\sigma_i$ terms without changing anything. This is true, but it needs a bit of trickery, because of the (possible) dependence between the different summands. Following the authors, but simplifying the notation a bit, define \[ T_i = \left\{ \begin{array}{cc} Z & \sigma_i = +1\\ \Zprime & \sigma_i = -1 \end{array} \right. ~ , ~ T^{\prime}_i = \left\{ \begin{array}{cc} \Zprime & \sigma_i = +1 \\ Z & \sigma_i = -1 \end{array}\right. \] Now notice that if $\sigma_i = +1$, then \[ f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(\Zprime) - f_i(Z)) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \] On the other hand, if $\sigma_i = -1$, then \[ f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(Z) - f_i(\Zprime)) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \] Since $\sigma_i$ is either $+1$ or $-1$, we have \begin{equation} f_i(\Zprime) - f_i(Z) = \sigma_i(f_i(T^{\prime}_i) - f_i(T_i)) \label{eqn:symmetric-difference-in-terms-of-rad-vars} \end{equation} Substituting \eqref{eqn:symmetric-difference-in-terms-of-rad-vars} into \eqref{eqn:phi-after-symmetrizing} \begin{eqnarray*} \Expect{\Phi(Z)} & \leq & \Expect{\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{f_i(\Zprime) - f_i(Z)}} \middle| \sigma}} \\ & = & \Expect{\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i (f_i(T^{\prime}_i) - f_i(T_i))}} \middle | \sigma}} \\ & = & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i(f_i(\Zprime) - f_i(Z))}}}\\ & \leq & \Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(\Zprime)}} + \sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}}\\ & = & 2\Expect{\sup_{f\in\mathcal{F}}{\frac{1}{n}\sum_{i=1}^{n}{\sigma_i f_i(Z)}}}\\ & = & 2\mathcal{R}_n(\mathcal{F}) \end{eqnarray*}
This is, I think, a very nice way to show that Rademacher complexity still controls over-fitting with dependent data. (This result in fact subsumes our result in arxiv:1106.0730, and London et al. have, I think, a more elegant proof.)
Now we put everything together.
Suppose that $\mathcal{F}$ is uniformly collectively $\beta$-stable. Then with probability at least $1-\delta$, uniformly over $f \in \mathcal{F}$, \[ \Expect{\overline{f}(Z)} \leq \overline{f}(Z) + 2\mathcal{R}_n(\mathcal{F}) + \beta \Eta_n \sqrt{\frac{\log{1/\delta}}{2n}} \]This is not quite the Theorem 2 of London et al., because they go through some additional steps to relate the collective stability of predictions to the collective stability of loss functions, but at this point I think the message is clear.
That message, as promised in the abstract, has three parts. The three conditions which are jointly sufficient to allow generalization from a single big, inter-dependent instance are:
I suspect this trio of conditions is not jointly necessary as well, but that's very much a topic for the future. I also have some thoughts about whether, with dependent data, we really want to control $\Expect{\overline{f}(Z)}$, or rather whether the goal shouldn't be something else, but that'll take another post.
Posted by crshalizi at June 22, 2014 10:54 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; Scientifiction and Fantastica; Writing for Antiquity; Tales of Our Ancestors; Physics; The Progressive Forces
Posted by crshalizi at May 31, 2014 23:59 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Afghanistan and Central Asia; Islam; Philosophy; Writing for Antiquity
Posted by crshalizi at April 30, 2014 23:59 | permanent link
Attention conservation notice: Only of interest if you (1) care about estimating complicated statistical models, and (2) will be in Pittsburgh on Monday.
Much of what I know about graphical models I learned from Prof. Lauritzen's book. His work on sufficienct statistics and extremal models, and their connections to symmetry and prediction, has shaped how I think about big chunks of statistics, including stochastic processes and networks. I am really looking forward to this.
(To add some commentary purely of my own: I sometimes encounter the idea that frequentist statistics is somehow completely committed to maximum likelihood, and has nothing to offer when that fails, as it sometimes does [1]. While I can't of course speak for every frequentist statistician, this seems silly. Frequentism is a family of ideas about when probability makes sense, and it leads to some ideas about how to evaluate statistical models and methods, namely, by their error properties. What justifies maximum likelihood estimation, from this perspective, is not the intrinsic inalienable rightness of taking that function and making it big. Rather, it's that in many situations maximum likelihood converges to the right answer (consistency), and in a somewhat narrower range will converge as fast as anything else (efficiency). When those fail, so much the worse for maximum likelihood; use something else that is consistent. In situations where maximizing the likelihood has nice mathematical properties but is computationally intractable, so much the worse for maximum likelihood; use something else that's consistent and tractable. Estimation by minimizing a well-behaved objective function has many nice features, so when we give up on likelihood it's reasonable to try minimizing some other proper scoring function, but again, there's nothing which says we must.)
[1]: It's not worth my time today to link to particular examples; I'll just say that from my own reading and conversation, this opinion is not totally confined to the kind of website which proves that rule 34 applies even to Bayes's theorem. ^
Posted by crshalizi at April 01, 2014 10:45 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; The Beloved Republic; Commit a Social Science; The Dismal Science; Linkage
Posted by crshalizi at March 31, 2014 23:59 | permanent link
Lo these many years ago, I blogged about how a paper of John Norton's had led me to have doubts about Landauer's Principle. Prof. Norton has continued to work on this topic, and I am very happy to share the news about his upcoming talk at CMU's "Energy and Information" seminar:
(For the record, I remain of at least two minds about Landauer's principle. The positive arguments for it seem either special cases or circular, but the conclusion makes so much sense...)
Manual trackback / update, 1 April 2014: Eric Drexler's Metamodern, who objects that "the stages of computation themselves need not be in equilibrium with one another, and hence subject to back-and-forth fluctuations" (his italics). In particular, Drexler suggests introducing an external time-varying potential that "can carry a system deterministically through a series of stages while the system remains at nearly perfect thermodynamic equilibrium at each stage". But I think this means that the whole set-up is not in equilibrium, and in fact this proposal seems quite compatible with sec. 2.2 of Norton's "No-Go" paper. Norton's agrees that "there is no obstacle to introducing a slight disequilibrium in a macroscopic system in order to nudge a thermodynamically reversible process to completion"; his claim is that the magnitude of the required disequilibria, measured in terms of free energy, are large compared to Landauer's bound. The point is not that it's impossible to build molecular-scale computers (which would be absurd), but that they will have to dissipate much more heat than Landauer suggests. I won't pretend this settles the matter, but I do have a lecture to prepare...
Posted by crshalizi at March 15, 2014 11:25 | permanent link
Attention conservation notice: Late notice of an academic talk in Pittsburgh. Only of interest if you care about the places where the kind of statistical theory that leans on concepts like "the graphical Markov property" merges with the kind of analytical metaphysics which tries to count the number of possibly fat men not currently standing in my doorway.
A great division in the field of causal inference in statistics is between those who like to think of everything in terms of "potential outcomes", and those who like to think of everything in terms graphical models. More exactly, while partisans of potential outcomes tend to denigrate graphical models (*), those of us who like the latter tend to presume that potential outcomes can be read off from graphs, and hope someone will get around to showing some sort of formal equivalence.
That somebody appears to have arrived.
As always, the talk is free and open to the public, whether the public follows their arrows or not.
*: I myself have heard Donald Rubin assert that graphical models cannot handle counterfactuals, or non-additive interactions between variables (particularly that they cannot handle non-additive treatments), and that their study leads to neglecting analysis-of-design questions. (This was during his talk at the CMU workshop "Statistical and Machine Learning Approaches to Network Experimention", 22 April 2013.) This does not diminish Rubin's massive contributions to statistics in general, and to causal inference in particular, but does not exactly indicate a thorough knowledge of a literature which goes rather beyond "playing with arrows".
Posted by crshalizi at March 04, 2014 17:50 | permanent link
Attention conservation notice: I have no taste. To exemplify this, the theme for the month was finally getting a tablet, and so indulging in a taste for not very sophisticated comic books.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Cthulhiana; Writing for Antiquity; The Commonwealth of Letters; Physics; The Eternal Silence of These Infinite Spaces
Posted by crshalizi at February 28, 2014 23:59 | permanent link
Attention conservation notice: Navel-gazing by a middle-aged academic.
I got tenure a few weeks ago. (Technically it takes effect in July.) The feedback from the department and university which accompanied the decision was gratifyingly positive, and I'm pleased that blogging didn't hurt me at all, and perhaps even helped. I got here with the help of a lot of mentors, colleagues, and friends (not a few of whom I met through this blog), and I feel some vindication on their behalf. For myself, I feel — relieved, even pleased.
Relieved and pleased, but not triumphant. I benefited from a huge number of lucky breaks. I know too many people who would be at least as good in this sort of job as I am, and would like such a job, but instead have ones which are far less good for them. If a few job applications, grant decisions, or choices about what to work on when had turned out a bit different, I could have been just as knowledgeable, had ideas just as good, worked on them as obsessively, etc., and still been in their positions, at best. Since my tenure decision came through, I've had two papers and a grant proposal rejected, and another paper idea I've been working on for more than a year scooped. A month like that at the wrong point earlier on might well have sunk my academic career. You don't get tenure at a major university without being a productive scholar (not for the most part anyway), but you also don't get it without being crazily lucky, and I can't feel triumphant about luck.
It's also hard for me to feel triumph because, by the time I get tenure, I will have been at CMU for nine years and change. Doing anything for that long marks you, or at least it marks me, and I'm not sure I like the marks. The point of tenure is security, and I hope to broaden my work, to follow some interests which are more speculative and risky and seem like they will take longer to pay off, if they ever do. But I have acquired habits and made commitments which will be very hard to shift. One of those habits is to think of my future in terms of what sort of scholarly work I'm going to be doing, and presuming that I will be working all the time, with only weak separation between work and the rest of life. I even have some fear that this has deformed my character, making some ordinary kinds of happiness insanely difficult. But maybe "deformed" is the wrong word; maybe I stuck with this job because I was already that kind of person. I can't bring myself to wish I wasn't so academic in my interests, or that I hadn't pursued the career I have, or that I had been less lucky in it. But I worry about what I have given up for it, and how those choices will look in another nine years, or twenty-nine.
Sometime in the future, I may write about what I think about tenure as an institution. But today is a beautiful winter's day here in Pittsburgh, cold but clear, the sky is a brilliant pale blue right now. It's my 40th birthday. I'm going outside to take a walk — and then probably going back to work.
Posted by crshalizi at February 28, 2014 15:52 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; Writing for Antiquity; The Great Transformation; The Collective Use and Evolution of Concepts; The Continuing Crises; The Beloved Republic
Posted by crshalizi at January 31, 2014 23:59 | permanent link
This was not one of my better performances as a teacher.
I felt disorganized and unmotivated, which is a bit perverse, since it's the third time I've taught the class, and I know the material very well by now. The labs were too long, and my attempts to shove the excess parts of the labs into revised homework assignments did not go over very well. The final projects were decent, but on average not as good as the previous two years.
I have two ideas about what went wrong. One is of course about kids these days (i.e., blaming the victims), and the other is about my own defects of character.
First, in retrospect, previous iterations of the course benefited from the fact that there hadn't been an undergraduate course here in statistical computing. This meant there was a large pool of advanced statistics majors who wanted to take it, but already knew a lot of the background materials and skills; the modal student was also more academically mature generally. That supply of over-trained students is now exhausted, and it's not coming back either — the class is going to become a requirement for the statistics major. (As it should.) So I need to adjust my expectations of what they know and can do on their own downward in a major way. More exactly, if I want them to know how to do something, I have to make sure I teach it to them, and cut other things from the curriculum to make room. This, I signally failed to do.
Second, I think the fact that this was the third time I have taught basically the same content was in fact part of the problem. It made me feel too familiar with everything, and gave me an excuse for not working on devising new material up until the last moment, which meant I didn't have everything at my finger's ends, and frankly I wasn't as excited about it either.
Putting these together suggests that a better idea for next time would be something like the following.
All of this will be a lot of work for me, but that's part of the point. Hopefully, I will make the time to do this, and it will help.
Posted by crshalizi at January 02, 2014 18:01 | permanent link
Attention conservation notice: Navel-gazing.
Paper manuscripts completed: 4
Papers accepted: 3
Papers rejected: 4 (fools! we'll show you all!)
Papers in revise-and-resubmit purgatory: 2
Papers in refereeing limbo: 1
Papers with co-authors waiting for me to revise: 7
Other papers in progress: I won't look in that directory and you can't
make me
Grant proposals submitted: 5
Grant proposals funded: 1
Grant proposals rejected: 3 (fools! we'll show you all!)
Grant proposals in refereeing limbo: 2
Grant proposals in progress for next year: 1
Grant proposals refereed: 2
Talk given and conferences attended: 17, in 10 cities
Classes taught: 2 [i, ii]
New classes taught: 0
Summer school classes taught: 1
New summer school classes taught: 0
Pages of new course material written: not that much
Manuscripts refereed: 21
Number of times I was asked to referee my own manuscript: 0
Manuscripts waiting for me to referee: 5
Manuscripts for which I was the responsible associate editor
at Annals of Applied
Statistics: 4
Book proposals reviewed: 1
Book proposals submitted: 0
Book outlines made and torn up: 3
Book manuscripts completed: 0
Book manuscripts due soon: 1
Students who completed their dissertations: 0
Students who completed their dissertation proposals: 0
Students preparing to propose in the coming year: 4
Letters of recommendation sent: 60+
Dissertations at other universities for which I was an external examiner: 2
(i, ii)
Promotions received: 0
Tenure packets submitted: 1
Days until final decision on tenure: < 30
Book reviews published on dead trees: 0
Weblog posts: 93
Substantive posts: 17, counting algal growths
Incomplete posts in the drafts folder: 39
Incomplete posts transferred to the papers-in-progress folder: 1
Books acquired: 260
Books begun: 104
Books finished: 76
Books given up: 3
Books sold: 28
Books donated: 0
Major life transitions: 1
Posted by crshalizi at January 01, 2014 00:01 | permanent link