A Solution to the Ecological Inference Problem

<title>Gary King, A Solution to the Ecological Inference Problem</title>

<cite><a href="../">The Bactra Review: Occasional and eclectic book reviews by Cosma Shalizi</a></cite> &nbsp; <strong>95</strong>

<h1>A Solution to the Ecological Inference Problem</h1>

<h2>Reconstructing Individual Behavior from Aggregate Data</h2>

<h2><em>by</em> <a href="../authors.html#gary-king">Gary King</a></h2>

Princeton University Press, 1997

<hr>

An ecological inference problem is one like this.  In a certain election, we
know how many votes were cast for the Democratic and the Republican candidates,
and we know how many voters were men and how many were women.  Can we say
anything reasonable about whether men are more likely to vote Republican than
are women?  More generally and abstractly, given aggregates (number of
Republican voters, number of men), we want to know things about individuals, or
at least sub-aggregates (if not how Juanita Q. Public voted, then the
proportion of women voting Republican). In particular, as King emphasizes in
this book, in the USA questions about whether people belonging to different
races have different political propensities are extremely important in
implementing the Voting Rights Act and its sequelae, but, owing to the secret
ballot, we have no direct knowledge of how people of different races vote.

<P>Until recently, the leading approach to the ecological inference problem was
a method called ``Goodman's regression''; its use in voting rights cases has
been endorsed by the Supreme Court.  This was a Nice Try (King explains it in
ch. 3, so I shan't), but it has some serious flaws.  The most obvious one is
that it doesn't realize probabilities must lie between 0 and 1, and so is
perfectly happy to tell us that, in a given district in a given election, 110%
of blacks voted Democratic and the like.  (Of course, in cities whose dead
people are very civic minded, like Baltimore and Chicago, this may be true...)
King quotes a large number of such impossibilities from legal documents.
Clearly, something better is needed, and it's provided here.

<P>He begins by observing that we can definitely (``deterministically'') put
some limits on individual behavior from the aggregate data.  In the two-race,
two-party case, for instance, we'd be done if we knew the proportion of blacks
voting Democratic and the proportion of whites voting Democratic.  We can show
the two proportions together with a point in the unit square --- and we can do
better than just saying it's someplace in that square.  The total number of
Democratic votes is just the proportion of blacks voting Democratic times the
number of blacks plus the proportion of whites voting Democratic times the
number of whites.  Since the number of Democratic votes is fixed, the true
proportions must fall somewhere on a downward-sloping line, whose slope is
determined by the ratio of blacks to whites, and whose intercepts will depend
on the total number of votes as well.

<P>If we had only one set of aggregates, this would be as far as we could go,
but in general we're interested in situations where we have lots of aggregate
data-points, spread out over space (all the polling precincts in New Jersey,
say) or time (the same precinct over many elections) or both.  Each set of
aggregates gives us a line; King decorates the unit square with the set of all
such lines, and calls the result a ``tomography plot.''  (The name comes from
an analogy to medical imaging, explored in some detail in ch. 6.)
Unfortunately (as I sometimes had to explain to my students in intro.  physics)
this is <em>not</em> enough to get the solution.  As King reasonably says,
there isn't any exact way of recovering the sub-totals.  What we can do,
however, is guess, i.e., try to use statistics.

<P>The simplest assumption to start with is that the true proportions for each
precinct is picked at random from some distribution, but that all the precincts
are drawn from the <em>same</em> population.  (Precincts vary, but not
systematically.)  King focuses on the case where that distribution is a
Gaussian, truncated so that all proportions are between 0 and 1.  The question
then is how to get the parameters for the Gaussian.  The trick here is to use
the lines from the tomography plot.  Most of them should pass near the mean of
the Gaussian, forming a dark blob on the plot.  Therefore, we guess at the
distribution by using the parameters which maximize the probability of a sample
falling on the observed tomography lines.  (King has written an entire book to
extol maximizing the likelihood, but it seems reliable in this case.)

<P>Once we've constructed a global distribution, for each precinct we can
estimate the true proportions by sampling from the constructed distribution,
restricted to the tomography line.  Getting an exact closed form for that
sampling distribution is too much to hope for, but we can always use Monte
Carlo simulation to approximate it, and form point estimates and confidence
intervals from it in the usual way.  (Oddly, King thinks that Monte Carlo has a
special affinity to Bayesian methods.)  At that point, essentially, we're done,
since we have sub-aggregates for each precinct, and can then answer questions
like whether, say, blacks are systematically more likely to vote Republican
than are whites.

<P>Having lavished several chapters on this method, it is a great relief to see
that King actually tries it out on real data-sets, ones where the
individual-level answers are known, and provide a check on his results.  Not
surprisingly, perhaps (since he <em>does</em> report them!), the agreement
between his constructions and reality is very good, e.g., about eighty percent
of all districts fall within his eighty percent confidence bounds.  Certainly
his estimates are at once much more detailed and much more accurate than those
delivered by other estimation procedures, including Goodman's regression.  I
find this reassuring, because his procedure looks reasonable to me, and it
seems to have been <a href="professionals.html">adopted by the
professionals</a> with some enthusiasm.

<P>King goes through a number of extensions to his basic procedure (going from
the 2x2 case to the RxC case; allowing certain kinds of systematic spatial
variation among precincts; handling aggregation bias).  He also has a good
discussion how one can tell when the assumptions which are built into his
method are wrong, and the extent to which the procedure is sensitive to small
errors in the assumptions.  He does not calculate error probabilities, or
confidence intervals for the global parameters, but those are at least
asymptotically normal, so work could be done along these lines (probably
involving more Monte Carlo runs).  One key assumption (not discussed by King,
that I noticed) is that the aggregate is the sum of individual-level
properties.  (Indeed, some of his machinery is specialized to the case of
summing proportions, but that's easily generalized.)  This is fine when you're
counting heads (as he is), or even totaling up money, but it's very far from
the general case.  Analogous methods could probably be developed for uglier
forms of aggregation, particularly nonlinear ones, but I suspect these methods
would themselves pretty ugly.  Nonetheless, this problem really should be taken
up by those of us working on complex systems, unless we want to just explain
our own models to each other forever.

<P><cite>A Solution</cite> is clearly written, if relentlessly dry; it
definitely requires a good grounding in statistics, say an intermediate-level
undergraduate course.  The book is also very narrowly focused on this
particular method and making it work.  On the one hand, King says nothing
substantive about society at all.  (He doesn't even guess at how many voting
rights cases have been mis-handled because Goodman's regression gave bad
results.)  On the other hand, he only looks at methods and ideas from within
the quantitative social sciences and the more traditional branches of
statistics: nothing about systems identification, or data-mining, or other
computational techniques for squeezing some meaning out of large data sets, not
even a mention of geographic information systems.  Still, King <em>is</em> a
political scientist, and I suppose we're lucky that he even looked as far
afield as econometrics.  Despite this narrow vision, it should definitely be
read by social scientists who use aggregated data, by marketers trying to
improve their demographic aim, and probably by <a
href="ecology.html">ecologists</a>, complex systems wallahs, and computer
scientists interested in techniques for manipulating large data sets.

<hr>xxii + 342 pp., 53 black and white figures (maps and graphs), 18
tables, bibliography, author and subject index, useful glossary of notation

<br><a href="../subjects/economics.html">Economics</a> /
	<a href="../subjects/geography.html">Geography</a> /
	<a href="../subjects/politics.html">Politics and Political Thought</a> /
	<a href="../subjects/probability.html">Probability and Statistics</a> /
	<a href="../subjects/sociology.html">Sociology</a>

<br>Currently in print as a hardback, ISBN 0-691-012415, US$62.50, and as a
paperback, ISBN 0-691-01240-7, US$17.95.  LoC JA71.7 K55 1997

<hr>6 October 1999; thanks to <a href="http://www.santafe.edu/~tozier/">Bill
Tozier</a>