Indirect inference is, I think, a really important methodological advance,
one which opens the door to doing a *lot* of useful statistics on models
of complex systems. However, Gouriéroux and Monfort write for a reader
who is very familiar with theoretical statistics, in particular with concepts
such as the likelihood and maximum likelihood estimation, Fisher information,
the score, consistency and efficiency, and so forth, though no measure theory.
(Say, Wasserman's All of
Statistics.) No special knowledge of econometrics is really needed,
though the last three chapters may seem under-motivated to those not committed
to standard econometric models. All this being the case, in the rest of my
review I will presume the reader has at least some recollection of the basic
ideas of probability, expectation, etc.

Let me start by giving some concrete examples of what I mean by "what the
model predicts for different parameters". Typically, predictions will depend
not just on the parameters, $ \theta $, but also on some external or
"exogenous" variables, which the model doesn't attempt to predict, *z*.
Different methods of estimation can then be based on different predictions
about the "endogenous" variables *y*.

In the "generalized method of moments", one picks a number of functions of
the data *y* and the exogenous variables, say $ K_i(y,z) $,
with *i* here just being an index for these "generalized moments". One
would then calculate both the expected or predicted value of the moments (a
function of the parameter)
\[
\mathbf{E}_{\theta}[K_i(y,z)] \equiv k_i(\theta,z)
\]
and the empirical or realized value of the moments (a function of the data)
\[
\frac{1}{T}\sum_{t}{K_i(y_t,z)} \equiv \hat{k}_i(z)
\]
with the sum running over all the data points. One's guess for the parameter,
$ \hat{\theta}_{GMM} $, is the value of $ \theta $ which makes the expectations
as close to the realization as possible. Provided some law of large numbers
or ergodic theorem holds,
\[
\hat{k}_i(z) \rightarrow k_i(\theta_0,z)
\]
where $ \theta_0 $ is the true parameter value, so the estimator is
"consistent", i.e.,
\[
\hat{\theta}_{GMM} \rightarrow \theta_0
\]
if the mapping from $ \theta$ to generalized moments $k_i(\theta,z)$ is
invertible.
(There are actually some minor "regularity" conditions needed for consistency,
over and above the law of large numbers, but let's let that slide here.)

The method of least squares works similarly. We assume that
\[
\mathbf{E}_{\theta}[Y_t|y_1^{t-1},z] = f(y_1^{t-1},z;\theta)
\]
where we know the functional form *f* (and where $ y_1^{t-1} $
means "all the observations from time 1 to time *t*-1"). The mean squared
prediction error at a given $ \theta $ is then
\[
\frac{1}{T}\sum_{t}{{\left(y_t - f(y_1^{t-1},z;\theta)\right)}^2}
\]
which we minimize over $ \theta $. (This can be seen as a version of
the method of moments, with a different "moment" for each observation.)

Finally, the method of maximum likelihood asks "how often should we expect to see data like this, under this model?", and tries to maximize that probability: \[ L(\theta,z) = \sum_{t}{\log{p_{\theta}(y_t|y_1^{t-1},z)}} \] where $ p_{\theta}(y_t|y_1^{t-1},z) $ is the probability density. (Bayesian estimation is a likelihood-based method, in which the impact of facts and experience is blunted and smoothed by prejudice.)

Originally, all of these methods of estimation were practical only if one could derive a simple formula for the best-fitting parameter values as a function of the data. Latter, with the rise of numerical optimization on cheap, fast computers, one could get away from needing an exact formula, provided it was possible to say precisely what the model predicted --- most often, what the likelihood function was.

This sounds like it ought to be easy, but there are many models which are very natural from a scientific view-point (because they nicely represent mechanisms we guess are at work) for which exact expressions for the likelihood, or indeed for other predictions, just are not available. In modeling dynamics, for example, if what we observe is not the full state of the system, but rather only part of it (and generally a part distorted by noise and nonlinearity at that), it becomes exceedingly difficult to calculate the probability of seeing a given sequence of observations. Or, again, if one's model is specified in terms of the behavior of large numbers of interacting entities (like molecules or economic agents), each possibly with an unobserved internal state, finding an exact likelihood function is pretty much hopeless. If we nonetheless want to connect our models to reality, and estimate parameters, what then should we do?

Gouriéroux and Monfort's answer turns on the fact that even though many interesting models can be simulated even when they can't be solved. That is, one can fairly quickly and cheaply "run them forward" to generate examples of the kind of behavior they say should happen, if necessary making many simulation runs to get many samples of the behavior they predict. One can then use those samples for estimation, and this in two ways, "direct" and "indirect".

The "direct" method of simulation-based inference is older and more
straightforward; just use the sample of simulation runs as an approximation to
the probability distribution generated by the model. In the formulas where one
would want to use the theoretical probabilities to calculate expectations,
likelihoods, etc., substitute the appropriate average over simulations. The
easiest way to see how this works is with the method of moments. The actual
expectations $ k_i(\theta,z) $ can be very hard to calculate
analytically. In the "method of simulated moments" (chapter 2), one doesn't
even try, but rather fixes $ \theta $ and runs the simulator *S*
times, each run being the same size as the data, giving simulated
values $ y^{(s,\theta)}_{t} $. One then treats the simulated mean,
\[
\hat{k}^{S}_i(\theta,z) = \frac{1}{S}\sum_{s}{\frac{1}{T}\sum_{t}{K_i(y^{(s,\theta)}_t,z)}}
\]
as though it were the exact mean. This introduced extra error into the
estimate of $ \theta $, of course, but this error will shrink as the
number of simulation runs (*S*) grows. (Gouriéroux and Monfort
consider some clever tricks for re-using the same set of random number draws
for multiple $ \theta $, which reduces the computational load.) Some
care is needed to preserve convergence to the truth as the data size
(*T*) grows, but this can still be arranged.
The other classical estimation methods work similarly (chapter 3). If one can
draw from the predictive density, $ p_{\theta}(y_t|y_1^{t-1},z) $, then the
average of several such draws is an estimate of the conditional expectation, $
\mathbf{E}_{\theta}[Y_t|y_1^{t-1},z] $, and can be used in the method of
simulated least squares. Only slightly more exotic, if $
p_{\theta}(y_t|y_1^{t-1},z) $ can't itself be drawn from, but one can generate
a random variable whose *expectation* is equal to the conditional
density, one can then employ the method of simulated maximum likelihood.
Remarkably, this retains (approximately) many of the nice properties of actual
maximum likelihood estimation, at least if the number of simulation runs is
large enough compared to the data size.

The "principle of indirect inference" (ch. 4) is more subtle, and to me much more exciting. In this approach, one introduces an "auxiliary" or "instrumental" model, which is not in general expected to be correct, but is supposed to be something which is easy to fit to the data. One then fits the auxiliary model both to the data, getting auxiliary parameter values $ \hat{\beta}(\mathrm{data}) $, simulations from the primary model for various values of the latter's parameters, getting auxiliary parameter values $ \hat{\beta}(\mathrm{sim},\theta) $. The indirect estimate of $ \theta $ is then the parameter setting where $ \hat{\beta}(\mathrm{sim},\theta) $ comes closest to $ \hat{\beta}(\mathrm{data}) $. In effect, one is still comparing the model's predictions to the data, but the prediction is now "what will the auxiliary model look like?", rather than more direct feature of the data.

For this to work, there are essentially two requirements. The first requirement is that, if we feed in larger and larger samples from the primary model, with its parameters held to $ \theta $, then the estimates of the auxiliary parameters will converge, $ \hat{\beta}(\mathrm{sim},\theta) \rightarrow b(\theta) $. The second requirement is that $ b(\theta) $ be invertible. Assuming these assumptions hold, the indirect estimate will be consistent, that is, it will converge on the true value of $ \theta $. (Gouriéroux and Monfort actually [p. 85] prove consistency under a stronger set of assumptions, which entail these, but these are the ones which actually do the work.) Under somewhat stronger assumptions, they are also able to say something about the limiting distribution of indirect estimates around the truth, and even to derive a version of the Cramér-Rao inequality.

The first assumption, convergence of auxiliary parameter estimates, is very weak, though not altogether trivial. The second assumption basically demands that the auxiliary model be rich enough to distinguish between different versions of the primary model. Typically, but not necessarily always, this will entail their being at least as many auxiliary parameters as there are primary ones, though these needn't correspond in any useful or comprehensible way. The distributional and Cramér-Rao-style results are of the kind one would expect: the indirect estimates will be more precise when the auxiliary parameters can be precisely estimated from the data, and when small differences in the auxiliary parameters correspond to large differences in the primary parameters.

Chapters 5, 6 and 7 apply direct and indirect simulation inference to a
range of popular models from econometrics, comparing the results to those of
other estimation methods on both simulated and real-world data. Some of these
are extremely impressive — in particular some of the results on
complicated time-series models are simply astonishing — but these
chapters will frankly be very hard going for anyone who has not seen these
econometric models before. (Chapter 5, in particular, includes *an awful
lot* on how to simulate discrete choice models.) Other applications will
readily suggest themselves to any reader who has worked with simulation models.

x+174 pp., bibliography, line figures, index (spotty)

Economics / Probability and Statistics

In print as a hardback, ISBN 0-19-877475-3

24 November 2007; thanks to Stephen Ellner, Linqiao Zhao and Mark Schervish

Updated 16 March 2012: small typo fixes, switched to using MathJax