## Consistency (in statistics) or "Probably Approximately Correct"

*02 Jun 2018 19:38*

In statistical jargon, an estimator of some quantity is "consistent" when it converges (in probability) on the true value as it gets more and more data. This is a mis-leading name for an important idea.

Suppose we have some quantity, canonically $\theta$, which we are trying to
estimate from data $n$ data points, $ X_1, \ldots X_n \equiv X_{1:n} $ . Our
data are random, so any estimate based on the data, $\hat{\theta}_n$ will
also be random. We say that $\hat{\theta}_n$ is "consistent" when it converges
in probability on $\theta$, whatever that true value might happen to be. As
you know, Babur, "converges in probability", means that you can set any margin
of approximation $\epsilon > 0$, and any probability of error $\delta > 0$, and
I can find an $N(\epsilon, \delta) < \infty$ which guarantees that, for $n \geq
N(\epsilon, \delta)$, we have an $\epsilon$-good approximation with probability
at least $1-\delta$:
\[
\mathbb{P}\left(|\theta - \hat{\theta}_n| \geq \epsilon\right) \leq \delta
\]
(If you are the sort of person who immediately asks "What about data which
doesn't arrive in the form of a sequence, but an increasingly filled-in spatial
grid?" or "What if I have only one measurement, but it's increasingly
precise?", you can amuse yourself by adjusting the definition accordingly.)
Turned around, convergence in probability means that you can pick any
$\epsilon > 0$ and any $n$, and I can not only find a $\delta(\epsilon, n) < 1$
such that
\[
\mathbb{P}\left(|\theta-\hat{\theta}_n| \geq \epsilon\right) \leq \delta(\epsilon, n)
\]
but also that, for every *fixed* $\epsilon$,
\[
\delta(\epsilon, n) \rightarrow 0
\]

For doing inference, this is
a pretty important notion, albeit a limited one. A more ambitious
goal would require that the estimator eventually give us the truth;
consistency doesn't require that. Or, we might ask that the estimator
always come arbitrarily close to the truth, with enough data; consistency
doesn't even require that. (*) All it requires is that we can have
arbitrarily high (probabilistic) confidence of eventually getting
arbitrarily close. But *this* is often attainable.

When computer scientists re-invented this notion in the 1980s, they gave it the much more transparent name of "probably approximately correct" (PAC). Though, being computer scientists, they often add the restriction that the number of samples required, $N$, should grow only polynomially in $1/\epsilon$ and $1/\delta$, and sometimes the further restriction that $\hat{\theta}_n$ be computable from $ X_{1:n} $ in polynomial time.

I am not *sure* of how the name "consistency" came about, but my
impression, backed by
"Earliest Known Uses of Some of the
Words of Mathematics", s.v. "Consistency", is as follows. Fisher (at least
at one point) thought of estimators and other statistical procedures as
involving calculating the sample or "empirical" counter-parts of population
quantities, like means, medians, or correlation coefficients. For him, a
procedure would be "consistent" if it gave the right answer when the population
values were plugged in. This, I think we can agree, is a sensible use of the
word "consistent". Fisher-consistency in turn *suggests* that the
procedure will converge on the true value as the sample gets larger and larger,
since we know (from the Glivenko-Cantelli theorem) that the empirical
distribution converges on the true, population distribution. But
Fisher-consistency isn't enough for convergence to the truth; there would need
to be some form of continuity as well. It is also not necessary, since many
procedures don't take the form of calculating empirical counter-parts of
population quantities. So "consistency" came to mean convergence to the truth,
which is what really matters for inference.

*: To be fair, we might ask to come arbitrarily close to the truth *with
probability 1*, i.e., let the estimator mess up on bizarre, probability-0
exceptions. This "almost sure" convergence is often called "strong
consistency". Interestingly,
large deviations theory often lets us show
that the error probabilities $\delta(\epsilon, n)$ I mentioned above are
exponentially small in $n$, therefore their sum $\sum_{n}{\delta(\epsilon, n)}
< \infty$. Then
the Borel-Cantelli
lemma tells us that $\hat{\theta}_n$ is more than $\epsilon$ away from
$\theta$ only finitely often, and so that $\hat{\theta}_n \rightarrow \theta$
with probability 1. (Indeed, the same argument works for
slower-than-exponential decay of error probabilities, e.g., $\delta =
O(n^{-2})$, so long as the deviation probabilities are summable.)