Notebooks
http://bactra.org/notebooks
Cosma's NotebooksenConfidence Sets, Confidence Intervals
http://bactra.org/notebooks/2024/03/10#confidence-sets
<P>This is, to my mind, one of the more beautiful and useful ideas
in <a href="statistics.html">statistics</a>, but also one of the more tricky.
(I might admire the idea more <em>because</em> of the trickiness.)
<P>We have some parameter of a stochastic model we want to learn about,
proverbially \( \theta \), which lives in the parameter space \( \Theta \). We
observe random data, say \( X \). The distribution of \( X \) changes with \(
\theta \), so the probability law is \( P_{\theta} \). Our game is one of
"statistical inference", i.e., we look at \( X \) and make a guess about \(
\theta \) on that basis. One type of guess would be an exact value for \(
\theta \), a <em>point estimate</em>. But we'd basically never expect any
point estimate to be exactly right, and we'd like to be able to say something
about the uncertainty. A <strong>level \( \alpha \) confidence set</strong> is
a <em>random</em> set of parameter values \( C_{\alpha} \subseteq \Theta \)
which contains the true parameter value, <em>whatever it might happen to
be</em>, with probability \( \alpha \) (at least):
\[
\min_{\theta \in \Theta}{P_{\theta}(\theta \in C_{\alpha})} \geq \alpha
\]
We say that \( C_{\alpha} \) has <strong>coverage level</strong> \( \alpha \).
<P>Quibbles:
<ul>
<li> It's (pragmatically) implied that the coverage probability is \( =\alpha \) for at least some \( \theta \); if the probability is \( > \alpha \) for all \( \theta \), we say the confidence set is "conservative".
<li> If you know enough to quibble about "min" vs. "inf", you also know what I
meant.
<li> \( C_{\alpha} \) is really \( C_{\alpha}(X) \), a (measurable) function of the data, but I am trying to keep the notation under control.
<li> In many situations there will be other ("nuisance") parameters we <em>don't</em> care about, canonically \( \psi \), and then we have to consider the worst case over both \( \theta \) and \( \psi \) simultaneously, even if really only want to draw inference about \( \theta \).
</ul>
<h4>Either the confidence set contains the truth, or we were really unlucky</h4>
<P>Now, confidence sets are notoriously hard for learners to wrap their minds
around, but I have a way of explaining them which <em>seems</em> to work when I
teach, and so I might as well share.
<P>When I construct a confidence set from our data, I am offering you, the
reader, a dilemma: <em>Either</em>
<ol>
<li> the true parameter value is in the confidence set \( C_{\alpha} \), <em>or</em>
<li> we were very unlucky, and we got data that was very improbable (\( P \leq
1-\alpha \) and unrepresentative under <em>all</em> values of the parameter.
</ol>
The second fork of the dilemma obtains because the event \( \theta \not\in
C_{\alpha} \) clearly has probability at most \( 1-\alpha \), <em>regardless</em> of \( \theta \).
<P>(More strictly there is really a <em>tri</em>-lemma here:
<ol>
<li> the true parameter value is in the confidence set \( C_{\alpha} \), <em>or</em>
<li> we were very unlucky, and we got data that was very improbable (\( P \leq
1-\alpha \) and unrepresentative under <em>all</em> values of the parameter, <em>or</em>
<li> the model we're using to calculate probabilities is wrong.
</ol>
But even interpreting parameters in mis-specified models is hard, and I don't
want to pursue the third fork [tine?] of the trilemma here.)
<h4>The confidence set is every parameter value we can't reject</h4>
<P>At this point a very reasonable question is to ask how on Earth we're
supposed to find such a set. Here is one very general procedure. Suppose that
we can statistically test whether \( \theta = \theta_0 \). That is, we have
some function \( T(X;\theta_0) \) which returns 0 if \( X \) looks like it
could have come from \( \theta=\theta_0 \), and returns 1 otherwise. More
concretely, \( P_{\theta_0}{(T(X;\theta_0) = 1)} \leq 1-\alpha \), so the "false
positive" rate or "false rejection" rate is at most \( 1-\alpha \). (That is,
the "size" of the test is at most \( 1-\alpha \), over all parameter values.)
Now building \( C_{\alpha} \) is very easy:
\[
C_{\alpha}(X) = \left\{ \theta \in \Theta ~ : ~ T(X;\theta) = 0 \right\}
\]
(Here I am being explicit that \( C_{\alpha} \) is a function of the data \( X \), which I otherwise suppress in the notation.)
<P>In words: the confidence set consists of all the parameter values we
compatible with the data, i.e., all the parameter values we can't reject (at
any acceptably low error rate \( 1-\alpha \) ).
<P>This construction is called "inverting the hypothesis test". Clearly, any
hypothesis test gives us a confidence set, by inversion. Equally clearly, any
confidence set can be used to give a hypothesis test: to test whether \( \theta
= \theta_0 \), see whether \( \theta_0 \in C_{\alpha} \); the false-rejection
rate of this test is, by construction, \( \leq 1-\alpha \).
<P>It is a little less clear that <em>every</em> confidence set can be
constructed by inverting <em>some</em> test, but it's nonetheless true, and
a textbook result (see, e.g., Casella and Berger, or Schervish). This is called the "duality between hypothesis tests and confidence sets".
<h4>Consistency and Evidence</h4>
<P>Now at this point you might feel we're done, because we've got a range of
parameter values which we know is right with high probability. Of course you
might worry about what probability means about any <em>particular</em> case,
but there's no <em>special</em> difficulty about that here, as opposed to (say)
predicting the risk of rain tomorrow. But there is an additional
wrinkle here, which has to do with <a href="consistency-pac.html">consistency</a>, or <em>convergence</em> to the truth.
<P>Suppose we get larger and larger data sets, \( X_n \) with \( n \rightarrow
\infty \). For each one, we construct a confidence set \( C_{\alpha}(X_n) \).
What we would <em>like</em> to have happen is for these sets to get smaller and
smaller, and to converge on the true value, \( C_{\alpha} \rightarrow \theta
\). That is, if the true \( \theta \neq \theta_0 \), we'd like \(
P_{\theta}(\theta_0 \in C_{\alpha}(X_n)) \rightarrow 1 \) as \( n \rightarrow
\infty \). If we think about things in terms of the hypothesis test, we'd like
the probability of <em>correctly</em> rejecting the <em>wrong</em> parameter
values to go to 1 as we get more and more data (at constant false-rejection
probability). So: inverting a consistent hypothesis test gives us a consistent
confidence set (one which converges on the truth), and vice versa.
<P>If we have a consistent confidence set, then, I claim, we've
got <em>evidence</em> that the true parameter value is in the set.
<P>(When a parameter is only partially identified, then inverting consistent
tests will give confidence regions converging to the set of observationally-equivalent parameter values, rather than to a single point.)
<h4>Confidence Intervals</h4>
I have written about confidence "sets" because the basic logic is very abstract
and doesn't rely on any geometric properties of the parameter space. But in
many situations the parameters we're interested in are real number, and the
test functions \( T(X;\theta) \) are piece-wise constant in \( \theta \). This
is the sort of situation where the confidence set we'll get by inverting a test
is an interval. In a few Euclidean dimensions, we might get a ball or box,
or anyway some sort of compact, connected region. But in many of the situations
I'm interested in, the parameter of interest is something like a function
or a network, and "interval" just isn't going to cut it.
<h4><a name="confidence-sets-for-model-selection">Confidence Sets for Model Selection</a></h4>
There are many problems which are basically forms
of <a href="model-selection.html">model selection</a> where it would be very
nice to quantify uncertainty in the form of confidence sets. Examples include:
the number of clusters in a <a href="mixture-models.html">mixture model</a>;
the number of factors in a <a href="factor-models.html">factor model</a>;
which <a href="variable-selection.html">variables</a> to include in a
regression; the order of a <a href="inference-markov.html">Markov chain</a>;
the <a href="inference-markov.html">context tree of a variable-length Markov
chain</a>; the <a href="graphical-causal-models.html">directed acyclic graph in
a graphical causal model</a>. Unfortunately it seems to me that we will
usually only be able to give <em>one-sided</em> confidence sets, saying, in
effect, "the process must have at least this much structure, but it could have
infinitely more".
<P>To see the issue, take mixture models. Suppose the data really did come
from a \( k \)-cluster mixture model, \( f(x) = \sum_{i=1}^{k}{\alpha_i
f(x;\theta_i)} \). I can approximate this arbitrarily closely using an \( (m)
\)-cluster mixture, for any \( m > k \). The trick is to just to reduce the \(
\alpha_1, \ldots \alpha_k \) very slightly, and make \( \alpha_{k+1}, \ldots
\alpha_m \) very close to zero --- so close that it'd be unlikely to have
actually drawn data from any of those clusters in the first \( n \) samples.
Thus any \( k \) cluster distribution is actually arbitrarily close to
infinitely many distributions with arbitrarily more clusters. This is true
for any sensible and relevant sense of distance between distributions --- <a href="information-theory.html">Kullback-Leibler divergence</a>, anything from
<a href="info-geo.html">information geometry</a>, total variation, etc.
<P>Similarly, for factor models, I just make the loadings on the extra factors
extremely small (but not zero). For Markov models, I make the conditional
dependence on the remote past extremely small (but not zero). For variable
selection in regression, I make the slope on the extra regressors extremely
small (but not zero). For graphical causal models, I make the extra causal
links extremely weak (but not zero). Because, in all these cases, the
distribution of observables changes smoothly as these parameters are varied,
but the model structure changes abruptly when various of these parameters hit
zero, I don't think we ever get to <em>rule out</em> very complicated
structures. (More exactly: there's no way to rule them out with any
statistical power; we could always use <a href="gygax-tests.html">Gygax
tests</a>.) We can rule out structures which are too simple to account for the
data, and we can say that we have no <em>need</em> for the complicated ones,
yet, but that's a one-sided confidence set. This moves us
towards <a href="occams-razor.html">Occam's razor</a> (particularly Kevin
Kelly's version of it --- follow the link).
<P>Formally: divide the over-all parameter vector \( \theta \) into the
discrete, model-structure part \( \kappa \) and the remaining bits \( \eta \).
Say that the true parameter \( \theta^* = (\kappa^*, \eta^* ) \). What I've
argued above is that for <em>any</em> \( \kappa, \eta \), we can find \(
\kappa^{\prime}, \eta^{\prime} \), with \( \kappa^{\prime} > \kappa \), such that, at
any sample size \( n \), the distance between the distribution generated by \(
(\kappa^{\prime}, \eta^{\prime}) \) and that generated by \( (\kappa^*, \eta^*) \)
is arbitrarily small. Hence no test can have any power to reject \(
(\kappa^{\prime}, \eta^{\prime}) \), hence no confidence set (with any power) can
exclude it. Thus we cannot exclude \( \kappa=\kappa^{\prime} \), because it's
compatible with the data for <em>some</em> value of \( \eta \). It's vital
here that when we increase \( \kappa \) to \( \kappa^{\prime} \), we can make
compensating changes to \( \eta \) to stay in close in distribution to \(
(\kappa^*, \eta^*) \). If that's ruled out for some reason, we're back in
business.
<P>Now, for variable selection, there is an apparent way out, which is the use
of the lasso or similar. But the trick there is the <em>assumption</em> that
the true regression coefficient vector is sparse. If we're sure that at most
\( s \) of the \( p \) regressors have non-zero coefficients, \( s \ll p \),
and we have in fact detected, say, \( \approx s \) non-zero coefficients, we
can indeed be pretty sure that there aren't many more of them lurking around.
(Or, at least, we can transfer some confidence from our sparsity assumption to
our conclusion about which variables matter.) Exactly what would take the
place of sparsity for other model-selection problems is something I should
think through.
<P>One final note on this sub-topic: Suppose that the true distribution really
is a \( k \)-cluster mixture. Then we should be able to reject distributions
where there are (say) 100 extra clusters <em>of large weight</em>, and we will
become more and more able to reject them as we get more data. So those very
complicated models with lots of extra structure will tend to become ones where
the extra structure does less and less work. (Again, this gets us back towards
Kelly-style Occam's razors.) If we try to form confidence sets for
not <em>just</em> the model structure (here, number of clusters), but the whole
model, <em>those</em> sets will work properly.
<ul>See also:
<li><a href="bootstrap.html">Bootstrapping, and Other Resampling Methods</a> (for one particularly useful way of building confidence sets)
<li><a href="conformal-prediction.html">Conformal Prediction</a>
<li><a href="gygax-tests.html">Gygax Tests</a>
<li><a href="nonparametric-confidence-sets.html">Nonparametric Confidence Sets for Functions</a>
<li><a href="partial-identification.html">Partial Identification</a>
</ul>
<ul>Recommended, big picture but textbook treatments:
<li>George Casella and R. L. Berger, <cite>Statistical Inference</cite>
<li>Mark J. Schervish, <cite>Theory of Statistics</cite>
</ul>
<ul>Recommended, close-ups:
<li>Don Fraser, "Is Bayes posterior just quick and dirty confidence",
<cite>Statistical Science</cite> <strong>26</strong> (2011): 299--316,
<a href="http://arxiv.org/abs/1112.5582">arxiv:1112.5582</a> [See also the
discussions by others, and Fraser's reply. My answer to the question posed in
Fraser's title is "yes", or rather "YES!"]
<li>Tore Schweder and Nils Lid Hjort, <cite><a href="http://cambridge.org/9780521861601">Confidence, Likelihood, Probability:
Statistical Inference with Confidence Distributions</a></cite> [I need to think very hard about the meaning and utility of their "confidence distributions"]
</ul>
<ul>Recommended, big picture, historical:
<li>Trygve Haavelmo, "The Probability Approach in Econometrics", <a href="https://doi.org/10.2307/1906935"><cite>Econometrica</cite> <strong>12</strong> supplement (1944): iii--115</a>
<li>Jerzy Neyman, "Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability", <a href="https://doi.org/10.1098/rsta.1937.0005"><cite>Philosophical Transactions of the Royal Society of London</cite> A <strong>236</strong> (1937): 333--380</a>
</ul>
<ul>Recommended with some reservations:
<li>Min-ge Xie, Peng Wang, "Repro Samples Method for Finite- and Large-Sample Inferences", <a href="http://arxiv.org/abs/2206.06421">arxiv:2206.06421</a> [Comments in their <a href="notes-on-repro-samples.html">own notebook</a>.]
</ul>
<ul>To read:
<li>Heng Lian, "Empirical Likelihood Confidence Intervals for Nonparametric Functional Data Analysis", <a href="http://arxiv.org/abs/0904.0843">arxiv:0904.0843</a>
<li>Jana Jankova, Sara van de Geer, "Confidence intervals for high-dimensional inverse covariance estimation", <a href="http://arxiv.org/abs/1403.6752">arxiv:1403.6752</a>
<li>Stephen M. S. Lee, "Hybrid confidence regions based on data depth",
<a href="http://dx.doi.org/10.1111/j.1467-9868.2011.01006.x"><cite>Journal of the Royal Statistical Society</cite> B <strong>74</strong> (2012): 91--109</a>
<li>Kesar Singh, Minge Xie, William E. Strawderman, "Confidence
distribution (CD) -- distribution estimator of a
parameter", <a href="http://projecteuclid.org/euclid.lnms/1196794948">pp. 132--150
in Regina Liu, William Strawderman and Cun-Hui Zhang (eds.), <cite>Complex
Datasets and Inverse Problems: Tomography, Networks and Beyond</cite></a>
<li>Amy Willis, "Confidence sets for phylogenetic trees",
<a href="https://doi.org/10.1080/01621459.2017.1395342"><cite>Journal of the American Statistical Association</cite> <strong>114</strong> (2019): 235--244</a>, <a href="http://arxiv.org/abs/1607.08288">arxiv:1607.08288</a>
<li>Amy Willis and Rayna Bell, "Uncertainty in Phylogenetic Tree Estimates", <a href="https://doi.org/10.1080/10618600.2017.1391697"><cite>Journal of Computational and Graphical Statistics</cite> <strong>27</strong> (2018): 542--552</a>, <a href="http://arxiv.org/abs/1611.03456">arxiv:1611.03456</a>
</ul>