The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   138

Simulation-Based Econometric Methods

by Christian Gouriéroux and Alain Monfort

Oxford University Press, 1996

By Indirection Find Direction Out

Statistical modeling has two parts: one is devising stochastic models of phenomena we care about; the other is relating those models to data, i.e., statistical inference. These stochastic models have unknown parameters, and one of the most important parts of statistics is estimating those parameters from data. The classical approach is to extract some prediction about what the data will look like from the model, and then use as one's estimate the parameter value whose prediction most nearly comes true. This book is about what to do when the model one wants to use is so complicated that one cannot calculate its predictions exactly, but can run simulations from it. The authors' most important idea is that of "indirect inference", where one introduces an "auxiliary" model which is easily fit to the data. One then fits the auxiliary both to the real data, and to simulated data from different settings of the primary model's parameters, and chooses the value of the latter where the fit to simulations best matches the fit to reality. In other words, the primary model is predicting the estimate of the auxiliary model's parameters! This sounds paradoxical, but it can work and work well, even if the auxiliary model isn't even remotely accurate, so long as it's easy to fit.

Indirect inference is, I think, a really important methodological advance, one which opens the door to doing a lot of useful statistics on models of complex systems. However, Gouriéroux and Monfort write for a reader who is very familiar with theoretical statistics, in particular with concepts such as the likelihood and maximum likelihood estimation, Fisher information, the score, consistency and efficiency, and so forth, though no measure theory. (Say, Wasserman's All of Statistics.) No special knowledge of econometrics is really needed, though the last three chapters may seem under-motivated to those not committed to standard econometric models. All this being the case, in the rest of my review I will presume the reader has at least some recollection of the basic ideas of probability, expectation, etc.

Let me start by giving some concrete examples of what I mean by "what the model predicts for different parameters". Typically, predictions will depend not just on the parameters, $ \theta $, but also on some external or "exogenous" variables, which the model doesn't attempt to predict, z. Different methods of estimation can then be based on different predictions about the "endogenous" variables y.

In the "generalized method of moments", one picks a number of functions of the data y and the exogenous variables, say $ K_i(y,z) $, with i here just being an index for these "generalized moments". One would then calculate both the expected or predicted value of the moments (a function of the parameter) \[ \mathbf{E}_{\theta}[K_i(y,z)] \equiv k_i(\theta,z) \] and the empirical or realized value of the moments (a function of the data) \[ \frac{1}{T}\sum_{t}{K_i(y_t,z)} \equiv \hat{k}_i(z) \] with the sum running over all the data points. One's guess for the parameter, $ \hat{\theta}_{GMM} $, is the value of $ \theta $ which makes the expectations as close to the realization as possible. Provided some law of large numbers or ergodic theorem holds, \[ \hat{k}_i(z) \rightarrow k_i(\theta_0,z) \] where $ \theta_0 $ is the true parameter value, so the estimator is "consistent", i.e., \[ \hat{\theta}_{GMM} \rightarrow \theta_0 \] if the mapping from $ \theta$ to generalized moments $k_i(\theta,z)$ is invertible. (There are actually some minor "regularity" conditions needed for consistency, over and above the law of large numbers, but let's let that slide here.)

The method of least squares works similarly. We assume that \[ \mathbf{E}_{\theta}[Y_t|y_1^{t-1},z] = f(y_1^{t-1},z;\theta) \] where we know the functional form f (and where $ y_1^{t-1} $ means "all the observations from time 1 to time t-1"). The mean squared prediction error at a given $ \theta $ is then \[ \frac{1}{T}\sum_{t}{{\left(y_t - f(y_1^{t-1},z;\theta)\right)}^2} \] which we minimize over $ \theta $. (This can be seen as a version of the method of moments, with a different "moment" for each observation.)

Finally, the method of maximum likelihood asks "how often should we expect to see data like this, under this model?", and tries to maximize that probability: \[ L(\theta,z) = \sum_{t}{\log{p_{\theta}(y_t|y_1^{t-1},z)}} \] where $ p_{\theta}(y_t|y_1^{t-1},z) $ is the probability density. (Bayesian estimation is a likelihood-based method, in which the impact of facts and experience is blunted and smoothed by prejudice.)

Originally, all of these methods of estimation were practical only if one could derive a simple formula for the best-fitting parameter values as a function of the data. Latter, with the rise of numerical optimization on cheap, fast computers, one could get away from needing an exact formula, provided it was possible to say precisely what the model predicted --- most often, what the likelihood function was.

This sounds like it ought to be easy, but there are many models which are very natural from a scientific view-point (because they nicely represent mechanisms we guess are at work) for which exact expressions for the likelihood, or indeed for other predictions, just are not available. In modeling dynamics, for example, if what we observe is not the full state of the system, but rather only part of it (and generally a part distorted by noise and nonlinearity at that), it becomes exceedingly difficult to calculate the probability of seeing a given sequence of observations. Or, again, if one's model is specified in terms of the behavior of large numbers of interacting entities (like molecules or economic agents), each possibly with an unobserved internal state, finding an exact likelihood function is pretty much hopeless. If we nonetheless want to connect our models to reality, and estimate parameters, what then should we do?

Gouriéroux and Monfort's answer turns on the fact that even though many interesting models can be simulated even when they can't be solved. That is, one can fairly quickly and cheaply "run them forward" to generate examples of the kind of behavior they say should happen, if necessary making many simulation runs to get many samples of the behavior they predict. One can then use those samples for estimation, and this in two ways, "direct" and "indirect".

The "direct" method of simulation-based inference is older and more straightforward; just use the sample of simulation runs as an approximation to the probability distribution generated by the model. In the formulas where one would want to use the theoretical probabilities to calculate expectations, likelihoods, etc., substitute the appropriate average over simulations. The easiest way to see how this works is with the method of moments. The actual expectations $ k_i(\theta,z) $ can be very hard to calculate analytically. In the "method of simulated moments" (chapter 2), one doesn't even try, but rather fixes $ \theta $ and runs the simulator S times, each run being the same size as the data, giving simulated values $ y^{(s,\theta)}_{t} $. One then treats the simulated mean, \[ \hat{k}^{S}_i(\theta,z) = \frac{1}{S}\sum_{s}{\frac{1}{T}\sum_{t}{K_i(y^{(s,\theta)}_t,z)}} \] as though it were the exact mean. This introduced extra error into the estimate of $ \theta $, of course, but this error will shrink as the number of simulation runs (S) grows. (Gouriéroux and Monfort consider some clever tricks for re-using the same set of random number draws for multiple $ \theta $, which reduces the computational load.) Some care is needed to preserve convergence to the truth as the data size (T) grows, but this can still be arranged. The other classical estimation methods work similarly (chapter 3). If one can draw from the predictive density, $ p_{\theta}(y_t|y_1^{t-1},z) $, then the average of several such draws is an estimate of the conditional expectation, $ \mathbf{E}_{\theta}[Y_t|y_1^{t-1},z] $, and can be used in the method of simulated least squares. Only slightly more exotic, if $ p_{\theta}(y_t|y_1^{t-1},z) $ can't itself be drawn from, but one can generate a random variable whose expectation is equal to the conditional density, one can then employ the method of simulated maximum likelihood. Remarkably, this retains (approximately) many of the nice properties of actual maximum likelihood estimation, at least if the number of simulation runs is large enough compared to the data size.

The "principle of indirect inference" (ch. 4) is more subtle, and to me much more exciting. In this approach, one introduces an "auxiliary" or "instrumental" model, which is not in general expected to be correct, but is supposed to be something which is easy to fit to the data. One then fits the auxiliary model both to the data, getting auxiliary parameter values $ \hat{\beta}(\mathrm{data}) $, simulations from the primary model for various values of the latter's parameters, getting auxiliary parameter values $ \hat{\beta}(\mathrm{sim},\theta) $. The indirect estimate of $ \theta $ is then the parameter setting where $ \hat{\beta}(\mathrm{sim},\theta) $ comes closest to $ \hat{\beta}(\mathrm{data}) $. In effect, one is still comparing the model's predictions to the data, but the prediction is now "what will the auxiliary model look like?", rather than more direct feature of the data.

For this to work, there are essentially two requirements. The first requirement is that, if we feed in larger and larger samples from the primary model, with its parameters held to $ \theta $, then the estimates of the auxiliary parameters will converge, $ \hat{\beta}(\mathrm{sim},\theta) \rightarrow b(\theta) $. The second requirement is that $ b(\theta) $ be invertible. Assuming these assumptions hold, the indirect estimate will be consistent, that is, it will converge on the true value of $ \theta $. (Gouriéroux and Monfort actually [p. 85] prove consistency under a stronger set of assumptions, which entail these, but these are the ones which actually do the work.) Under somewhat stronger assumptions, they are also able to say something about the limiting distribution of indirect estimates around the truth, and even to derive a version of the Cramér-Rao inequality.

The first assumption, convergence of auxiliary parameter estimates, is very weak, though not altogether trivial. The second assumption basically demands that the auxiliary model be rich enough to distinguish between different versions of the primary model. Typically, but not necessarily always, this will entail their being at least as many auxiliary parameters as there are primary ones, though these needn't correspond in any useful or comprehensible way. The distributional and Cramér-Rao-style results are of the kind one would expect: the indirect estimates will be more precise when the auxiliary parameters can be precisely estimated from the data, and when small differences in the auxiliary parameters correspond to large differences in the primary parameters.

Chapters 5, 6 and 7 apply direct and indirect simulation inference to a range of popular models from econometrics, comparing the results to those of other estimation methods on both simulated and real-world data. Some of these are extremely impressive — in particular some of the results on complicated time-series models are simply astonishing — but these chapters will frankly be very hard going for anyone who has not seen these econometric models before. (Chapter 5, in particular, includes an awful lot on how to simulate discrete choice models.) Other applications will readily suggest themselves to any reader who has worked with simulation models.

x+174 pp., bibliography, line figures, index (spotty)

Economics / Probability and Statistics

In print as a hardback, ISBN 0-19-877475-3

24 November 2007; thanks to Stephen Ellner, Linqiao Zhao and Mark Schervish

Updated 16 March 2012: small typo fixes, switched to using MathJax