## Consistency (in statistics) or "Probably Approximately Correct"

*10 Nov 2017 15:51*

In statistical jargon, an estimator of some quantity is "consistent" when it converges (in probability) on the true value as it gets more and more data. This is a mis-leading name for an important idea.

Suppose we have some quantity, canonically $\theta$, which we are trying to
estimate from data $n$ data points, , \ldots X_n = X_{1:n}$. Our data are random, so any estimate based on the data,
$\hat{\theta}_n$ will also be random. We say that $\hat{\theta}_n$ is
"consistent" when it converges in probability on $\theta$, whatever that
true value might happen to be. As you know, Babur, "converges in probability",
means that you can set any margin of approximation $\epsilon > 0$,
and any probability of error $\delta > 0$, and I can find an
$N(\epsilon, \delta) < \infty$ which guarantees that, for $n \geq N(\epsilon, \delta)$,
we have an $\epsilon$-good approximation with probability at least $1-\delta$:
\[
\Pr{(|\theta - \hat{\theta}_n| \geq \epsilon) \leq \delta
\]
(If you are the sort of person who immediately asks "What about data which
doesn't arrive in the form of a sequence, but an increasingly filled-in spatial
grid?" or "What if I have only one measurement, but it's increasingly
precise?", you can amuse yourself by adjusting the definition accordingly.)
Turned around, convergence in probability means that you can pick any
$\epsilon > 0$ and any $n$, and I can not only find a $\delta(\epsilon, n) < 1$
such that
\[
\Prob{(|\theta-\hat{\theta}_n| \geq \epsilon) \leq \delta(\epsilon, n)
\]
but also that, for every *fixed* $\epsilon$,
\[
\delta(\epsilon, n) \rightarrow 0
\]

For doing inference, this is
a pretty important notion, albeit a limited one. A more ambitious
goal would require that the estimator eventually give us the truth;
consistency doesn't require that. Or, we might ask that the estimator
always come arbitrarily close to the truth, with enough data; consistency
doesn't even require that. (*) All it requires is that we can have
arbitrarily high (probabilistic) confidence of eventually getting
arbitrarily close. But *this* is often attainable.

When computer scientists re-invented this notion in the 1980s, they gave it the much more transparent name of "probably approximately correct" (PAC). (Though, being computer scientists, they often add the restriction that the number of samples required, $N$, should grow only polynomially in $1/\epsilon$ and $1/\delta$, and sometimes the further restriction that $\hat{\theta}_n$ be computable from {1:n}$ in polynomial time.)

I am not *sure* of how the name "consistency" came about, but my
impression, backed by
"Earliest Known Uses of Some of the
Words of Mathematics", s.v. "Consistency", is as follows. Fisher (at least
at one point) thought of estimators and other statistical procedures as
involving calculating the sample counter-parts of population quantities, like
means, medians, or correlation coefficients. For him, a procedure would be
"consistent" if it gave the right answer when the population values were
plugged in. This, I think we can agree, is a sensible use of the word
"consistent". Fisher-consistency in turn *suggests* that the procedure
will converge on the true value as the sample gets larger and larger. But
Fisher-consistency isn't sufficient for convergence to the truth;
there would need to be some form of continuity as well.
It is also not necessary, since many
procedures don't take the form of calculating sample counter-parts of
population quantities. So "consistency" came to mean convergence to the truth,
which is what really matters for inference.

*: To be fair, we might merely ask to come arbitrarily close to
the truth *with probability 1*, i.e., let the estimator
mess up on bizarre, probability-0 exceptions. This "almost sure"
convergence is often called "strong consistency". Interestingly,
large deviations theory often lets us show
that the error probabilities $\delta(\epsilon, n)$ I mentioned above are
exponentially small in $n$, therefore their sum $\sum_{n}{\delta(\epsilon, n)}
< \infty$. Then
the Borel-Cantelli
lemma tells us that $\hat{\theta}_n$ is more than $\epsilon$ away from
$\theta$ only finitely often, and so that $\hat{\theta}_n \rightarrow \theta$
with probability 1. (Indeed, the same argument works for
slower-than-exponential decay iof error probabilities, e.g., $\delta =
O(n^{-2})$, so long as they are summable.)