The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi   95

A Solution to the Ecological Inference Problem

Reconstructing Individual Behavior from Aggregate Data

by Gary King

Princeton University Press, 1997
An ecological inference problem is one like this. In a certain election, we know how many votes were cast for the Democratic and the Republican candidates, and we know how many voters were men and how many were women. Can we say anything reasonable about whether men are more likely to vote Republican than are women? More generally and abstractly, given aggregates (number of Republican voters, number of men), we want to know things about individuals, or at least sub-aggregates (if not how Juanita Q. Public voted, then the proportion of women voting Republican). In particular, as King emphasizes in this book, in the USA questions about whether people belonging to different races have different political propensities are extremely important in implementing the Voting Rights Act and its sequelae, but, owing to the secret ballot, we have no direct knowledge of how people of different races vote.

Until recently, the leading approach to the ecological inference problem was a method called ``Goodman's regression''; its use in voting rights cases has been endorsed by the Supreme Court. This was a Nice Try (King explains it in ch. 3, so I shan't), but it has some serious flaws. The most obvious one is that it doesn't realize probabilities must lie between 0 and 1, and so is perfectly happy to tell us that, in a given district in a given election, 110% of blacks voted Democratic and the like. (Of course, in cities whose dead people are very civic minded, like Baltimore and Chicago, this may be true...) King quotes a large number of such impossibilities from legal documents. Clearly, something better is needed, and it's provided here.

He begins by observing that we can definitely (``deterministically'') put some limits on individual behavior from the aggregate data. In the two-race, two-party case, for instance, we'd be done if we knew the proportion of blacks voting Democratic and the proportion of whites voting Democratic. We can show the two proportions together with a point in the unit square --- and we can do better than just saying it's someplace in that square. The total number of Democratic votes is just the proportion of blacks voting Democratic times the number of blacks plus the proportion of whites voting Democratic times the number of whites. Since the number of Democratic votes is fixed, the true proportions must fall somewhere on a downward-sloping line, whose slope is determined by the ratio of blacks to whites, and whose intercepts will depend on the total number of votes as well.

If we had only one set of aggregates, this would be as far as we could go, but in general we're interested in situations where we have lots of aggregate data-points, spread out over space (all the polling precincts in New Jersey, say) or time (the same precinct over many elections) or both. Each set of aggregates gives us a line; King decorates the unit square with the set of all such lines, and calls the result a ``tomography plot.'' (The name comes from an analogy to medical imaging, explored in some detail in ch. 6.) Unfortunately (as I sometimes had to explain to my students in intro. physics) this is not enough to get the solution. As King reasonably says, there isn't any exact way of recovering the sub-totals. What we can do, however, is guess, i.e., try to use statistics.

The simplest assumption to start with is that the true proportions for each precinct is picked at random from some distribution, but that all the precincts are drawn from the same population. (Precincts vary, but not systematically.) King focuses on the case where that distribution is a Gaussian, truncated so that all proportions are between 0 and 1. The question then is how to get the parameters for the Gaussian. The trick here is to use the lines from the tomography plot. Most of them should pass near the mean of the Gaussian, forming a dark blob on the plot. Therefore, we guess at the distribution by using the parameters which maximize the probability of a sample falling on the observed tomography lines. (King has written an entire book to extol maximizing the likelihood, but it seems reliable in this case.)

Once we've constructed a global distribution, for each precinct we can estimate the true proportions by sampling from the constructed distribution, restricted to the tomography line. Getting an exact closed form for that sampling distribution is too much to hope for, but we can always use Monte Carlo simulation to approximate it, and form point estimates and confidence intervals from it in the usual way. (Oddly, King thinks that Monte Carlo has a special affinity to Bayesian methods.) At that point, essentially, we're done, since we have sub-aggregates for each precinct, and can then answer questions like whether, say, blacks are systematically more likely to vote Republican than are whites.

Having lavished several chapters on this method, it is a great relief to see that King actually tries it out on real data-sets, ones where the individual-level answers are known, and provide a check on his results. Not surprisingly, perhaps (since he does report them!), the agreement between his constructions and reality is very good, e.g., about eighty percent of all districts fall within his eighty percent confidence bounds. Certainly his estimates are at once much more detailed and much more accurate than those delivered by other estimation procedures, including Goodman's regression. I find this reassuring, because his procedure looks reasonable to me, and it seems to have been adopted by the professionals with some enthusiasm.

King goes through a number of extensions to his basic procedure (going from the 2x2 case to the RxC case; allowing certain kinds of systematic spatial variation among precincts; handling aggregation bias). He also has a good discussion how one can tell when the assumptions which are built into his method are wrong, and the extent to which the procedure is sensitive to small errors in the assumptions. He does not calculate error probabilities, or confidence intervals for the global parameters, but those are at least asymptotically normal, so work could be done along these lines (probably involving more Monte Carlo runs). One key assumption (not discussed by King, that I noticed) is that the aggregate is the sum of individual-level properties. (Indeed, some of his machinery is specialized to the case of summing proportions, but that's easily generalized.) This is fine when you're counting heads (as he is), or even totaling up money, but it's very far from the general case. Analogous methods could probably be developed for uglier forms of aggregation, particularly nonlinear ones, but I suspect these methods would themselves pretty ugly. Nonetheless, this problem really should be taken up by those of us working on complex systems, unless we want to just explain our own models to each other forever.

A Solution is clearly written, if relentlessly dry; it definitely requires a good grounding in statistics, say an intermediate-level undergraduate course. The book is also very narrowly focused on this particular method and making it work. On the one hand, King says nothing substantive about society at all. (He doesn't even guess at how many voting rights cases have been mis-handled because Goodman's regression gave bad results.) On the other hand, he only looks at methods and ideas from within the quantitative social sciences and the more traditional branches of statistics: nothing about systems identification, or data-mining, or other computational techniques for squeezing some meaning out of large data sets, not even a mention of geographic information systems. Still, King is a political scientist, and I suppose we're lucky that he even looked as far afield as econometrics. Despite this narrow vision, it should definitely be read by social scientists who use aggregated data, by marketers trying to improve their demographic aim, and probably by ecologists, complex systems wallahs, and computer scientists interested in techniques for manipulating large data sets.

xxii + 342 pp., 53 black and white figures (maps and graphs), 18 tables, bibliography, author and subject index, useful glossary of notation
Economics / Geography / Politics and Political Thought / Probability and Statistics / Sociology
Currently in print as a hardback, ISBN 0-691-012415, US$62.50, and as a paperback, ISBN 0-691-01240-7, US$17.95. LoC JA71.7 K55 1997
6 October 1999; thanks to Bill Tozier