July 26, 2010

"Generalization Error Bounds for State Space Models: With an Application to Economic Forecasting"

Attention conservation notice: 500 words on a student's thesis proposal, combining all the thrills of macroeconomic forecasting with the stylish vivacity of statistical learning theory. Even if you care, why not check back in a few years when the work is further along?

Daniel McDonald is writing his thesis, under the joint supervision of Mark Schervish and myself. I can use the present participle, because on Thursday he successfully defended his proposal:

"Generalization Error Bounds for State Space Models: With an Application to Economic Forecasting" [PDF]
Abstract: In this thesis, I propose to derive entirely data dependent generalization error bounds for state space models. These results can characterize the out-of-sample accuracy of many types of forecasting methods. The bounds currently available for time series data rely both on a quantity describing the dependence properties of the data generating process known as the mixing rate and on a quantification of the complexity of the model space. I will derive methods for estimating the mixing behavior from data and characterize the complexity of state space models. The resulting risk bounds will be useful for empirical researchers at the forefront of economic forecasting as well as for economic policy makers. The bounds can also be applied in other situations where state space models are employed.

Some of you may prefer the slides (note that Daniel is using DeLong's reduction of DSGEs to D2 normal form), or an even more compact visual summary:

Most macroeconomic forecasting models are, or can be turned into, "state-space models". There's an underlying state variable or variables, which evolves according to a nice Markov process, and then what we actually measure is a noisy function of the state; given the current state, future states and current observations are independent. (Some people like to draw a distinction between "state-space models" and "hidden Markov models", but I've never seen why.) The calculations can be hairy, especially once you allow for nonlinearities, but one can show that, asymptotically, maximum likelihood estimation, as well as various regularizations, have all the nice asymptotic properties one could want.

Asymptotic statistical theory is, of course, useless for macroeconomics. Or rather: if our methods weren't consistent even with infinite data, we'd know we should just give up. But if the methods only begin to give usably precise answers when the number of data points gets over 1024, we should give up too. Knowing that things could work with infinite data doesn't help when we really have 252 data points, and serial dependence shrinks the effective sample size to about 12 or 15. The wonderful thing about modern statistical learning theory is that it gives non-asymptotic results, especially risk bounds that hold at finite sample sizes. This is, of course, the reason why ergodic theorems, and the correlation time of US GDP growth rates, have been on my mind recently. In particular, this is why we are thinking about ergodic theorems which give not just finite-sample bounds (like the toy theorem I posted about), but can be made to do so uniformly over whole classes of functions, e.g., the loss functions of different macro forecasting models and their parameterizations.

Anyone wanting to know how to deal with non-stationarity is reminded that Daniel is proposing a dissertation in statistics, and not a solution to the problem of induction.

Enigmas of Chance; The Dismal Science; Incestuous Amplification

Posted by crshalizi at July 26, 2010 15:30 | permanent link

Social Carbon Banditry (Dept. of Modest Proposals for Keeping Civilization from Suffocating In Its Own Waste)

Attention conservation notice: A consideration of social banditry as a tool of climate-change policy. Sadly, this mockery apparently has about as much chance of actually helping as does action by the world's leading democracy.

Only on Unfogged would the comments on a post about visual penis jokes turn to a discussion of what, if anything, civil disobedience could do about climate change; but they did.

One of the goals of classic civil disobedience is to make maintaining an unjust institution costly, though I'm not sure how often it is put in these terms. Ordinarily, those who are disadvantaged or subordinated by a prevailing institution go along with it, they follow its norms and conventions without having to be forced. — whether because they accept those norms, or because they reasonably fear the retaliation that would come if they flouted them makes little difference. This makes maintaining the injustice a good deal for the oppressors: not only do they get the immediate benefits of the institution, they don't have to expend a lot of effort maintaining it. Mass civil disobedience disrupts this state of affairs. Even if the oppressors can live with the evidence of seeing that they are, in fact, the kind of people who will engage in brutality to retain their privileges, the time policemen spend working over Sunday-school teachers, etc., is time they do not spend patrolling the streets, catching burglars, etc. Mass civil disobedience, especially if prolonged, raises the cost of perpetuating injustice. The implicit challenge to Pharaoh is: "Are you really willing to pay what it takes to keep us in bondage?"

What does this suggest when it comes to climate change? Burning fossil fuels is not an act with any intrinsic moral significance. The trouble with it is that my burning those fuels inflicts costs on everyone else, and there is no mechanism, yet, for bringing those costs home to me, the burner. The issue is not one of unjust institutions, but of an unpriced externality. The corresponding direct action, therefore, is not making oppressors actually enforce their institutions, but internalizing the externality. I envisage people descending on oil refineries, coal mines, etc., and forcing the operators to hand over sums proportional to the greenhouse-gas contribution of their sales. What happened to the money afterwards would be a secondary consideration at best (though I wouldn't recommend setting it on fire). The situation calls not for civil disobedience but for social carbon banditry.

Of course, to really be effective, the banditry would need to be persistent, universal, and uniform. Which is to say, the banditry has to become a form of government again, if not necessarily a part of the state.

Modest Proposals; The Dismal Science; The Continuing Crises

Posted by crshalizi at July 26, 2010 14:30 | permanent link

July 09, 2010

"Inferring Hierarchical Structure in Networks and Predicting Missing Links" (Next Week at the [Special Summer Bonus] Statistics Seminar)

Attention conservation notice: Only of interest if you are (1) in Pittsburgh next Tuesday, and (2) care about statistical network modeling and community discovery. Also, the guest is a friend, collaborator and mentor; but, despite his undiscriminating taste in acquaintances, an excellent speaker and scientist.

Usually, during the summer the CMU statistics seminar finds a shaded corner and drowses through the heat, with no more activity than an occasional twitch of its tail. Next week, however, it rouses itself for an exceptional visitor:

Cristopher Moore, "Inferring Hierarchical Structure in Networks and Predicting Missing Links"
Abstract: Given the large amounts of data that are now becoming available on social and biological networks, we need automated tools to extract important structural features from this data. Moreoever, for many networks, observing their links is a costly and imperfect process — food webs require field work, protein networks require combining pairs of proteins in the laboratory, and so on. Based on the part of the network we have seen so far, we would like to make good guesses about what pairs of vertices are likely to be connected, so we can focus limited resources on those pairs.
I will present a Bayesian approach to this problem, where we try to infer the hierarchical structure of the network, with communities and subcommunities at multiple levels of organization. We start with a rich model of random networks of this type, and then use a Monte Carlo Markov Chain to explore the space of these models. This approach performs quite well on real networks, often outperforming simple heuristics such as assuming that two vertices with neighbors in common are likely to be connected. In particular, it can handle both "assortative" behavior like that seen in many social networks, and "disassortative" behavior as in food webs.
Joint work with Aaron Clauset and Mark Newman.
Place and time: Tuesday, July 13, 2010, 4:00--5:00 p.m. in Porter Hall 125B

As usual, the seminar is free and open to the public.

Networks; Enigmas of Chance; Incestuous Amplification

Posted by crshalizi at July 09, 2010 14:33 | permanent link

July 03, 2010

Variations on a Patriotic Theme

"They'd ask me, 'Raf, what abut this Revolution of yours? What kind of world are you really trying to give us?' I've had a long time to consider that question."

"And?"

"Did you ever hear the Jimi Hendrix Rendition of 'The Star-Spangled Banner'?"

Starlitz blinked. "Are you kidding? That cut still moves major product off the back catalog."

"Next time, really listen to that piece of music. Try to imagine a country where that music truly was the national anthem. Not weird, not far-out, not hip, not a parody, not a protest against some war, not for young Yankees stoned on some stupid farm in New York. Where music like that was social reality. That is how I want people to live...."

[Bruce Sterling, A Good Old-Fashioned Future, pp. 104--105]

"I wasn't born in America. In point of fact, I wasn't even born. But I work for our government because I believe in America. I happen to believe that this is a unique society. We have a unique role in the world."

Oscar whacked the lab table with an open hand. "We invented the future! We built it! And if they could design or market it a little better than we could, then we just invented something else more amazing yet. If it took imagination, we always had that. If it took enterprise, we always had it. If it took daring and even ruthlessness, we had it — we not only built the atomic bomb, we used it! We're not some crowd of pious, sniveling, red-green Europeans trying to make the world safe for boutiques! We're not some swarm of Confucian social engineers who would love to watch the masses chop cotton for the next two millennia! We are a nation of hands-on cosmic mechanics!"

"And yet we're broke," Greta said.

[Bruce Sterling, Distraction, p. 90]

The Beloved Republic

Posted by crshalizi at July 03, 2010 22:30 | permanent link

July 02, 2010

The World's Simplest Ergodic Theorem

Attention conservation notice: Equation-filled attempt at a teaching note on some theorems in mathematical probability and their statistical application. (Plus an oblique swipe at macroeconomists.)

The "law of large numbers" says that averages of measurements calculated over increasingly large random samples converge on the averages calculated over the whole probability distribution; since that's a vague statement, there are actually several laws of large numbers, from the various ways of making this precise. As traditionally stated, they assume that the measurements are all independent of each other. Successive observations from a dynamical system or stochastic process are generally dependent on each other, so the laws of large numbers don't, strictly, apply, but they have analogs, called "ergodic theorems". (Blame Boltzmann.) Laws of large numbers and ergodic theorems are the foundations of statistics; they say that sufficiently large samples are representative of the underlying process, and so let us generalize from training data to future or currently-unobserved occurrences.

Here is the simplest route I know to such a theorem; I can't remember if I learned it from Prof. A. V. Chubukov's statistical mechanics class, or from Uriel Frisch's marvellous Turbulence. Start with a sequence of random variables X1, X2, ... Xn. Assume that they all have the same (finite) mean m and the same (finite) variance v; also assume that the covariance, E[XtXt+h] - E[Xt] E[Xt+h], depends only on the difference in times h and not on the starting time t. (These assumptions together comprise "second-order" or "weak" or "wide-sense" stationarity. Stationarity is not actually needed for ergodic theorems, one can get away with what's called "asymptotic mean stationarity", but stationarity simplifies the presentation here.) Call this covariance ch. We contemplate the arithmetic mean of the first n values in X, called the "time average":

\[ 
A_n = \frac{1}{n}\sum_{t=1}^{n}{X_t} 
 \]

What is the expectation value of the time average? Taking expectations is a linear operator, so

\[ 
\mathbf{E}[A_n] = \frac{1}{n}\sum_{t=1}^{n}{\mathbf{E}[X_t]} = \frac{n}{n}m = m 
 \]
which is re-assuring: the expectation of the time average is the common expectation. What we need for an ergodic theorem is to show that as n grows, An tends, in some sense, to get closer and closer to its expectation value.

The most obvious sense we could try is for the variance of An to shrink as n grows. Let's work out that variance, remembering that for any random variable Y, Var[Y] = E[Y2] - (E[Y])2.


\begin{eqnarray*} 
\mathrm{Var}[A_n^2] & = & \mathbf{E}[A_n^2] - m^2\\ 
& = & \frac{1}{n^2}\mathbf{E}\left[{\left(\sum_{t=1}^{n}{X_t}\right)}^2\right] - m^2\\ 
& = & \frac{1}{n^2}\mathbf{E}\left[\sum_{t=1}^{n}{\sum_{s=1}^{n}{X_t X_s}}\right] - m^2\\ 
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{\mathbf{E}\left[X_t X_s\right]}} - m^2\\ 
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{ c_{s-t} + m^2}} - m^2\\ 
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{s=1}^{n}{ c_{s-t}}}\\ 
& = & \frac{1}{n^2}\sum_{t=1}^{n}{\sum_{h=1-t}^{n-t}{ c_h}} 
\end{eqnarray*}

This used the linearity of expectations, and the definition of the covariances ch. Imagine that we write out all the covariances in an n*n matrix, and average them together; that's the variance of An. The entries on the diagonal of the matrix are all c0 = v, and the off-diagonal entries are symmetric, because (check this!) c-h = ch. So the sum over the whole matrix is the sum on the diagonal, plus twice the sum of what's above the diagonal.

\[ 
\mathrm{Var}[A_n^2] = \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}} 
 \]

If the Xt were uncorrelated, we'd have ch = 0 for all h > 0, so the variance of the time average would be O(n-1). Since independent random variables are necessarily uncorrelated (but not vice versa), we have just recovered a form of the law of large numbers for independent data. How can we make the remaining part, the sum over the upper triangle of the covariance matrix, go to zero as well?

We need to recognize that it won't automatically do so. The assumptions we've made so far are compatible with a process where X1 is chosen randomly, and then all subsequent observations are copies of it, so that then the variance of the time average is v, no matter how long the time series; this is the famous problem of checking a newspaper story by reading another copy of the same paper. (More formally, in this situation ch = v for all h, and you can check that plugging this in to the equations above gives v for variance of An for all n.) So if we want an ergodic theorem, we will have to impose some assumption on the covariances, one weaker than "they are all zero" but strong enough to exclude the sequence of identical copies.

Using two inequalities to put upper bounds on the variance of the time average suggests a natural and useful assumption which will give us our ergodic theorem.


\begin{eqnarray*} 
\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}} & \leq & \sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{|c_h|}}\\ 
& \leq & \sum_{t=1}^{n-1}{\sum_{h=1}^{\infty}{|c_h|}} 
\end{eqnarray*}
Covariances can be negative, so we upper-bound the sum of the actual covariances by the sum of their magnitudes. (There is no approximation here if all covariances are positive.) Then we extend the inner sum so it covers all lags. This might of course be infinite, and would be for the sequence-of-identical-copies. Our assumption then is
\[ 
\sum_{h=1}^{\infty}{|c_h|} < \infty 
 \]
This is a sum of covariances over time, so let's write it in a way which reflects those units: $ \sum_{h=1}^{\infty}{|c_h|} = v T $ , where T is called the "(auto)covariance time", "integrated (auto)covariance time" or "(auto)correlation time". We are assuming a finite correlation time. (Exercise: Suppose that $ c_h = v e^{-h \tau} $ , as would be the case for a first-order linear autoregressive model, and find T. This confirms, by the way, that the assumption of finite correlation time can be satisfied by processes with non-zero correlations.)

Returning to the variance of the time average,


\begin{eqnarray*} 
\mathrm{Var}[A_n^2] & = & \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{\sum_{h=1}^{n-t}{c_{h}}}\\ 
& \leq & \frac{v}{n} + \frac{2}{n^2}\sum_{t=1}^{n-1}{v T}\\ 
& = & \frac{v}{n} + \frac{2(n-1) vT}{n^2}\\ 
& \leq & \frac{v}{n} + \frac{2 vT}{n}\\ 
& = & \frac{v}{n}(1+ 2T) 
\end{eqnarray*}
So, if we can assume the correlation time is finite, the variance of the time averages goes is O(n-1), just as if the data were independent. However, the convergence is slower than for independent data by an over-all factor which depends only on T. As T shrinks to zero, we recover the result for uncorrelated data, an indication that our approximations were not too crude.

From knowing the variance, we can get rather tight bounds on the probability of An's deviations from m if we assume that the fluctuations are Gaussian. Unfortunately, none of our assumptions so far entitle us to assume that. For independent data, we get Gaussian fluctuations of averages via the central limit theorem, and these results, too, can be extended to dependent data. But the assumptions needed for dependent central limit theorems are much stronger than merely a finite correlation time. What needs to happen, roughly speaking, is that if I take (nearly) arbitrary functions f and g, the correlation between f(Xt) and g(Xt+h) must go to zero as h grows. (This idea is quantified as "mixing" or "weak dependence".)

However, even without the Gaussian assumption, we can put some bounds on deviation probabilities by bounding the variance (as we have) and using Chebyshev's inequality:

\[ 
\mathrm{Pr}\left(|A_n - m| > \epsilon\right) \leq \frac{\mathrm{Var}[A_n]}{\epsilon^2} \leq \frac{v}{\epsilon^2} \frac{2T+1}{n} 
 \]
which goes to zero as n grows. So we have just proved convergence "in mean square" and "in probability" of time averages on their stationary expectation values, i.e., the mean square and weak ergodic theorems, under the assumptions that the data are weakly stationary and the correlation time is finite. There were a couple of steps in our argument where we used not very tight inequalities, and it turns out we can weaken the assumption of finite correlation time. The necessary and sufficient condition for the mean-square ergodic theorem turns out to be that, as one might hope,
\[ 
\lim_{n\rightarrow 0}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0 
 \]
though I don't know of any way of proving it rigorously without using Fourier analysis (which is linked to the autocovariance via the Wiener-Khinchin theorem; see chapters 19 and 21 of Almost None of the Theory of Stochastic Processes).

Reverting to the case of finite correlation time T, observe that we have the same variance from n dependent samples as we would from n/(1+2T) independent ones. One way to think of this is that the dependence shrinks the effective sample size by a factor of 2T+1. Another, which is related to the name of "correlation time", is to imagine dividing the time series up into blocks of that length, i.e., a central point and its T neighbors in either direction, and use only the central points in our averages. Those are, in a sense, effectively uncorrelated. Non-trivial correlations extend about T time-steps in either direction. Knowing T can be very important in figuring out how much actual information is contained in your data set.

To give an illustration not entirely at random, quantitative macroeconomic modeling is usually based on official statistics, like GDP, which come out quarterly. For the US, which is the main but not exclusive focus of these efforts, the data effectively start in 1947, as what national income accounts exist before then are generally thought too noisy to use. Taking the GDP growth rate series from 1947 to the beginning of 2010, 252 quarters in all, de-trending, I calculate a correlation time of just over ten quarters. (This granting the economists their usual, but absurd, assumption that economic fluctuations are stationary.) So macroeconomic modelers effectively have 11 or 12 independent data points to argue over.

Constructively, this idea leads to the mathematical trick of "blocking". To extend a result about independent random sequences to dependent ones, divide the dependent sequence up into contiguous blocks, but with gaps between them, long enough that the blocks are nearly independent of each other. One then has the IID result for the blocks, plus a correction which depends on how much residual dependence remains despite the filler. Picking an appropriate combination of block length and spacing between blocks keeps the correction small, or at least controllable. This idea is used extensively in ergodic theory (including the simplest possible proof of the strong ergodic theorem) and information theory (see Almost None again), in proving convergence results for weakly dependent processes, in bootstrapping time series, and in statistical learning theory under dependence.

Manual trackback: An Ergodic Walk (fittingly enough); Thoughts on Economics

Enigmas of Chance

Posted by crshalizi at July 02, 2010 13:40 | permanent link

July 01, 2010

Posed While the Algae Grow in Their Fur

The last post was really negative; to cleanse the palate, look at the Sloth Sanctuary of Costa Rica, dedicated to rescuing orphaned and imperiled sloths.

(Via Environmental Grafitti, via Matthew Berryman, and with thanks to John Emerson)

Linkage

Posted by crshalizi at July 01, 2010 14:40 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems