Notes on "Repro Samples Method for Finite- and Large-Sample Inferences"

05 Mar 2024 11:33

Attention conservation notice: Somewhat critical notes on a technical contribution to the statistical literature I actually found very interesting, and compelling enough to study very closely. Not run past the authors so unusually likely to contain mistakes, even above my high baseline error rate. Also, full of unexplained jargon.

Min-ge Xie, Peng Wang, "Repro Samples Method for Finite- and Large-Sample Inferences", arxiv:2206.06421
Abstract: This article presents a novel, general, and effective simulation-inspired approach, called repro samples method, to conduct statistical inference. The approach studies the performance of artificial samples, referred to as repro samples, obtained by mimicking the true observed sample to achieve uncertainty quantification and construct confidence sets for parameters of interest with guaranteed coverage rates. Both exact and asymptotic inferences are developed. An attractive feature of the general framework developed is that it does not rely on the large sample central limit theorem and is likelihood-free. As such, it is thus effective for complicated inference problems which we can not solve using the large sample central limit theorem. The proposed method is applicable to a wide range of problems, including many open questions where solutions were previously unavailable, for example, those involving discrete or non-numerical parameters. To reduce the large computational cost of such inference problems, we develop a unique matching scheme to obtain a data-driven candidate set. Moreover, we show the advantages of the proposed framework over the classical Neyman-Pearson framework. We demonstrate the effectiveness of the proposed approach on various models throughout the paper and provide a case study that addresses an open inference question on how to quantify the uncertainty for the unknown number of components in a normal mixture model. To evaluate the empirical performance of our repro samples method, we conduct simulations and study real data examples with comparisons to existing approaches. Although the development pertains to the settings where the large sample central limit theorem does not apply, it also has direct extensions to the cases where the central limit theorem does hold."

I heard Prof. Xie give a talk based on this paper at CMU in the fall of 2023, and it was genuinely inspiring. It led to me studying this paper extremely carefully, which leaving me with the conclusion that their methods really do work, and should be broadly applicable. But they also lead me to the conclusion that authors' understandable enthusiasm for their real innovations has led in astray in some particulars.

Notation: When I write a function that takes a point as its argument, say $f(a)$, and instead I write a set as its argument, say $f(A)$, I mean the image of the set under the function, $f(A) \equiv \left\{ y: \exists a \in A : y=f(a) \right\}$. Similarly if I have a multi-argument function and I supply sets in place of one (or more) of its arguments.

  1. The paper consider models of the form $Y = G(\theta, U)$, where observe a $\mathcal{Y}$-valued random variable $Y$, and where $G$ is a fixed function (known to the modeler), and $U$ is a noise source whose distribution is the same regardless of the parameter $\theta$, and also known to the modeler. This lets them define a Borel set \( B_{\alpha} \) where $P(U \in B_{\alpha}) = \alpha$.
    Let me be clear that I have absolutely no problem with this. Lots of models look like this! I'd even argue that models which can't be put in this form are strange. On the one hand, measure-theoretic probability tells us that it's a very weird model where we can't just say that $U$ is uniformly distributed on $[0,1]$. (In that case it's customary to write $\omega$ instead of $U$.) On the other hand, information theory tells us that if we've got the right model we can use it to compress the data reversibly to uniform noise. In any event, reading the paper should convince the skeptic that lots of our favorite statistical models have this form.
  2. The most basic form of their confidence sets, on observing $Y=y$, is $\Gamma_{\alpha}(y) = \left\{ \theta: \exists u \in B_{\alpha} : y = G(\theta, u) \right\}$. Now I claim that this just is a form of Neyman inversion. For fixed $\theta$, the image set $G(\theta, B_{\alpha})$ is a measurable subset of $\mathcal{Y}$, say \( c_{\alpha}(\theta) \). (I use a lower-case $c$ to emphasize that this is a non-random set.) So, for each $\theta$, \[ P_{\theta}(Y \in C_{\theta}(\alpha)) \geq \alpha ~, \] and \[ \Gamma_{\alpha}(y) = \left\{ \theta: y \in c_{\alpha}(\theta) \right\} ~. \] This is a construction Neyman would've recognized: we're testing the hypothesis that $\theta=\theta_0$ by seeing whether or not $Y \in c_{\alpha}(\theta_0)$, and the size of this test is indeed $1-\alpha$.
    In words, the test is something like "Can the data we observed be reproduced by drawing the noise in the generative process from a fixed high-probability set, when $\theta=\theta_0$?" If yes, we accept $\theta_0$; if no, we reject it. Whether this kind of test is one Uncle Jerzy would have liked, I'm less sure. He would no doubt have asked pointed questions about its power. But if the generative process has any sort of ergodic / concentration of measure properties, it's tending to put probability 1 on a very particular set of outcomes, and any (identifiable) change to $\theta$ is concentrating on a different set of outcomes, so that doesn't sound too worrisome. Indeed, the notion of "typical sets" from information theory / large deviations theory might give us a way to approach the choice of \( B_{\alpha} \).
  3. The paper then introduces a (typically real-valued) "nuclear mapping" $T(u,\theta)$, and \( \theta \)-dependent Borel sets \( B_{\alpha}(\theta) \) in the range of $T$, with the property that \[ P_{\theta}(T(u,\theta) \in B_{\alpha}(\theta)) \geq \alpha ~. \] We then have confidence sets of the form \[ \Gamma_{\alpha}(y) = \left\{ \theta: \exists u: y=G(\theta, u) \cap T(u,\theta) \in B_{\alpha}(\theta) \right\} ~. \] That is, there's some value of the noise which exactly reproduces the observed data, and some function of noise-and-parameter falls into a high-probability set that can vary depending on the parameter.
    I claim that this, too, is Neyman inversion. For a given $\theta$, there is some set of $u$ such that $T(u,\theta) \in B_{\alpha}(\theta)$. Let us call this $\tau_{\alpha}(\theta)$. This is a Borel set in the space of $U$. Similarly, the image set \[ d_{\alpha}(\theta) \equiv G(\theta, \tau_{\alpha}(\theta)) \] is a measurable subset of $\mathcal{Y}$. If $y \in d_{\alpha}(\theta)$ then there exists some $u \in \tau_{\alpha}(\theta)$ such that $y=G(\theta, u)$, and because that $u \in \tau_{\alpha}(\theta)$, we have that $T(u, \theta) \in B_\alpha(\theta)$. Thus \[ P_{\theta}(Y \in d_{\alpha}(\theta)) = \alpha ~. \] So define \[ \Delta_{\alpha}(y) \equiv \left\{ \theta: ~ y \in d_{\alpha}(\theta) \right\} ~. \] Clearly $\Delta_{\alpha}(y) \subseteq \Gamma_{\alpha}(y)$, but equally clearly $\Delta_{\alpha}$ is a Neyman-inversion-style confidence set. To see that the reverse inclusion holds, i.e., that $\Gamma_{\alpha}(y) \subseteq \Delta_{\alpha}(y)$, notice that for each $\theta \in \Gamma_{\alpha}(y)$, (i) $y = G(\theta, u)$ for some $u$, and (ii) $T(u,\theta) \in B_\alpha(\theta)$ for that $u$. But (ii) implies that the postulated $u$ must be in $\tau_{\alpha}(\theta)$. Hence $y \in G(\theta, \tau_{\alpha}(\theta))$, and so $y \in d_{\alpha}(\theta)$. So, finally, $\Gamma_{\alpha}(y) = \Delta_{\alpha}(y)$, and we can just say \[ \Gamma_{\alpha}(y) = \left\{ \theta: y \in d_{\alpha}(\theta) \right\} ~. \]
  4. The manuscript calculates finite confidence intervals for the number of components in Gaussian mixture models. But classic arguments seem to show that one can, at best, give one-side confidence intervals for the number of mixture components, in effect saying "you need at least so many components". The reason being that it's trivial to add very small, low-probability components, or to split a component into two components with centers arbitrarily close to each other, and to have having vanishing power to detect those alterations. (Cf.) Now, carefully examining what this manuscript does, I note two things: (i) there is an imposed upper limit on the number of mixture components, and (ii) all the confidence intervals include this upper limit. I conclude that the real work is being done by the assumption (i), that there are at most so many mixture components...
If this were a more constructive note, I would at this point include some worked examples of how the repro method can make it straightforward to construct confidence sets, but I will instead refer you to the paper.

Update, 5 March 2024: Very shortly after writing the above, I saw that Profs. Xie and Wang have a new preprint (arxiv:2402.15004) on the method. I have not had a chance to read beyond the abstract, but that, at least, repeats the claims which I criticized above. I re-iterate that I haven't had a chance to read their new manuscript; when I do, I'll update this note. (And perhaps withdraw my objections, we'll see.)