November 25, 2013

Simulation V: Matching Simulation Models to Data (Introduction to Statistical Computing)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin} \]

(My notes for this lecture are too incomplete to be worth typing up, so here's the sketch.)

Methods, Models, Simulations

Statistical methods try to draw inferences about unknown realities from data. They always involve some more or less tightly specified stochastic model, some story (myth) about how the data were generated, and so how the data are linked to the world. We want to know how well our methods work, which is usually hard to assess by just running them on real data. Sometimes, we can push through analytic probability calculations under our model assumptions, but often not. Then we can get some baseline idea of how well our methods should work by simulating the models, running the simulation through the methods, and comparing the results to what we know the truth was in the simulation world. (That is, we see if we can recover the myth.) We can also see how robust our methods are by simulating processes which break the modeling assumptions more or less grossly.

If your method, or your implementation of a common method, doesn't work under simulations, then you had better have a very good argument for why it should be trusted with real data. Always go back and forth between simulations and method development \[ \mathrm{simulation}:\mathrm{methods}::\mathrm{unit\ tests}:\mathrm{functions} \]

Adjusting Simulation Models to Match Data

Many (most?) useful probability models are easily specified as step-by-step recipes for simulation, but have ugly, complicated distributions for the data. (Nonlinear dependencies and latent variables are both tend to lead to complicated distributions.) Since the models have adjustable parameters, we would like to adjust them so that they match the data we have about the world. If the distributions were straightforward, we could just calculate the likelihood and maximize it (or use Bayesian updating, etc.), but that's not really possible when the distribution is ugly. What to do?

Method of Moments and Friends

Moments (\(\Expect{X}, \Expect{X^2}, \ldots ) \) are functional of the probability distribution. If our model has a \( k \)-dimensional parameter vector \( \theta \), then we should generally be able to find \( k \) moments which characterize or identify the parameter, meaning that there are functions \( f_1, f_2, \ldots f_k \) such that (i) \( \Expect{X} = f_1(\theta), \Expect{X^2} = f_2(\theta), \ldots \Expect{X^k} = f_k(\theta) \), and (ii) if \( \theta_1 \neq \theta_2 \), then at least one of the \( k \) moments differs, \( f_j(\theta_1) \neq f_j(\theta_2) \) for some \( j \in 1:k \).

If we know the functions \( f_1, \ldots f_k \), and we knew the population moments, we could then solve for the matching \( \theta \). (We could also test the model, as opposed to the parameter value, by seeing if that \( \theta \) let us match other population moments; and of course if there were no solution for \( \theta \).) Since we only see sample moments, in the method of moments we try to solve for the parameters which match them: \( \hat{\theta}_{MM} \) is the \( \theta \) where \[ f_i(\theta) = \overline{x^i} ~, ~ i \in 1:k \] (We can then test this by looking at whether we can match other, mathematically-independent functionals.) Since it's annoying to have to keep the subscripts and superscripts \( i \) around, let's say that \( f(\theta) \) is the vector-valued function of theoretical moments, and \( m \) is the vector of sample moments. Our earlier assumption (ii) means that \( f^{-1} \) exists, so \[ \hat{\theta}_{MM} = f^{-1}(m) \] By the law of large numbers, we expect \( m \) to approach the population moments as we see more data. For \( n \) not-too-dependent observations, the error should be \( O(1/\sqrt{n}) \), and then if \( f^{-1} \) is a reasonably smooth function, propagation of error tells us that \( \|\hat{\theta}_MM - \theta \| = O(1/\sqrt{n}) \) itself.

It might happen, due to sampling fluctuations, that the observed moments \( m \) are not ones which could ever be theoretical moments for the model --- that \( m \) is not in the range of \( f(\theta) \), no matter how we adjust \( \theta \). The usual approach then is to minimize: \[ \hat{\theta}_{MM} = \argmin_{\theta}{\| f(\theta) - m \|} \] (While I've written this for the ordinary Euclidean norm, it is often better to minimize a weighted norm, giving more emphasis to the moments which have smaller standard errors. This refinement can come later.)

So far I've said nothing about simulation. I've presumed that we can find the moments as explicit functions of the parameters --- that we know \( f \). However, if we can simulate our model at our favorite \( \theta \), we can get arbitrarily accurate approximations to \( f \). Specifically, if we can do \( b \) separate runs of the model at \( \theta \), and average the output, we get an approximation of \( f(\theta) \), say \( \tilde{f}(\theta) \), whose error is \( O(1/\sqrt{b} \). We can thus look at \[ \hat{\theta}_{MSM} = \argmin_{\theta}{\| \tilde{f}(\theta) - m \|} \] and try to minimize it, to get our method of simulated moments estimate. There will now be two sources of error, simulation error (shrinking with \( b \), and under our control), and statistical error (shrinking with \( n \), and out of our control). Again, it can be better to use a weighted norm, emphasizing the moments which can be estimated more precisely.

There is nothing magic about moments --- all we need is a set of functionals of the distribution which characterize the parameters. The foregoing carries over without modification to the generalized method of moments, and the generalized method of simulated moments. It is often particularly useful to make the "generalized moments" parameters in some other statistical model, which is mis-specified but very easy to estimate; then we have indirect inference.

A Brute-Force Approach to Simulated Maximum Likelihood

Observe data \( x \). Pick an initial \( \theta \). Do \( b \) independent runs at that parameter value. Use your favorite non-parametric density estimator to get \( \tilde{p}_{\theta} \). (As \( b \rightarrow \infty \), the density estimate converges on the true density, \( \tilde{p}_{\theta} \rightarrow p_{\theta} \).) Now evaluate that estimated density at \( x \), and treat \( \tilde{p}_{\theta}(x) \) as the likelihood. Adjust \( \theta \) to maximize \( \tilde{p}_{\theta}(x) \).

This is not very efficient --- as the dimension of \( x \) grows, \( b \) has to grow exponentially to maintain a constant accuracy. Moreover, our representation of \( \tilde{p}_{\theta} \) is going to be horridly redundant, using a huge number of degrees of freedom when we only need \( k \). There are ways one could begin to get around these issues, but that's beyond the scope of these notes.

Checking the Simulation Model Against Data

Because we can simulate, we can look at what our model predicts for arbitrary aspects of the distribution of data. Basically any statistic of the data can thus be a test statistic. What makes a good test statistic?

Obviously, if we adjusted the parameters to match a certain statistic, that is not a good test. (Or at least, I hope it's obvious that if we adjusted the parameters to match the sample variance, we learn nothing by seeing that simulations have about the sample variance.) A little more subtly, nothing which depends mathematically on the statistics we matched is a good test. If we matched the first \( k \) moments, no function of those moments is suitable. (If we matched \( \overline{x} \) and \( \overline{x^2} \), we learn nothing by seeing if we can also match the sample variance.) We might turn to seeing if we can match higher moments, but those tend to be rather unstable, so that doesn't have much power.

The best recommendation I can make is to actually understand the process you're studying, and pick out interesting, relevant features which are logically independent of the moments you matched. This however requires subject-matter knowledge, rather than statistical or computational skill. In the absence of understanding, we can try to check the model using summary statistics other than the moments we used to estimate parameters --- e.g., if we matched the mean, how well do we do on the median? Another useful fall-back is to try re-running exploratory data analyses on simulation output, and seeing how well that matches the EDA on the data.

Over-identifying: If \( \theta \) is \( k \)-dimensional but we try to match more than \( k \) moments, it's not guaranteed that we can match all of them at once. The over-all discrepancy \( \| \tilde{f}(\theta) - m \| \) should still shrink as we get more data if the model is right. Under some circumstances, one can work out analytically what the distribution of that discrepancy should be under the model (e.g., \( \chi^2 \) or related), but you can also just simulate.

Breaking models: If you want to know how robust some model prediction is, you can try to deliberately search for parameters which will not make that prediction. Again, out of space I won't go into details, but see Miller's paper. This isn't, strictly speaking, about whether the model fits the data, though it could be adapted to that.

Readings

(Woefully incomplete, merely alphabetical order)

Introduction to Statistical Computing

Posted at November 25, 2013 10:30 | permanent link

Three-Toed Sloth