Some Notes on Influence Functions and Targeted Learning
Last update: 14 Dec 2024 15:54First version: 19 January 2024
Attention conservation notice: Written for a reading group circa 2016, largely to help myself understand, transcribed here to help me throw away the paper. Unusually likely, even for me, to contain stupid errors. If you're actually interested you should probably start with Fisher and Kennedy (2021) and go on from there.
The targeted learning idea (take 1)
Our \( n \) data points come iidly from a probability measure \( P \). (The empirical measure is \( P_n \).)We care about some functional \( \Psi(P) \), not all of \( P \).
We get a reasonable initial estimate \( M \) of \( P \).
We then re-shape \( M \) to pay more attention to the aspects of the distribution which matter for \( \Psi(P) \), the target of our inference, and try to ignore other stuff.
In particular, we set up and solve an estimating equation...
Abbreviated de Finetti notation for integrals and expectations
When \( \mu \) is a measure and \( f \) is a function it would make sense to integrate (real- or vector- valued, etc.), \[ \mu f \equiv \int{f(x) \mu(dx)} \]When \( \mu \) is a probability measure, this is the expected value of \( f(X) \), with \( X \sim \mu \).
(I don't know if de Finetti really invented this notation, but that's what everyone calls it nowadays.)
Warmup: Directional derivatives vs. gradients
Start with a function \( f: \mathbb{R}^d \mapsto \mathbb{R} \). We can define its derivative in the direction \( h \in \mathbb{R}^d \) as \[ Df(x,h) \equiv \lim_{\epsilon \rightarrow 0}{\frac{f(x+\epsilon h) - f(x)}{\epsilon}} \]
Against this, the gradient is the vector of partial derivatives, \[ \nabla f(x) \equiv \left[ \begin{array}{c} \frac{\partial f}{\partial x_1}(x) \\ \vdots \frac{\partial f}{\partial x_d}(x) \end{array} \right ] \] so (assuming the regularity gods are kind) \[ Df(x,h) = h \cdot \nabla f(x) \]
Derivatives of functionals of probability measures
We'd like to extend this idea to real-valued functionals of probability measures, \( \Psi: \mathcal{P} \mapsto \mathbb{R} \) (where \( \mathcal{P} \) is the set of all probability measures on some base space, or maybe a restriction of that). But when we make a small change to the measure, we need to make sure we stay on the manifold of probability measures, rather than launching ourselves into the larger space of signed and/or unnormalized measures. So we define the derivative of \( \Psi \) at the probability measure \( P \) in the direction of the probability measure \( H \) via a convex combination: \[ D \Psi(P, H) = \lim_{\epsilon\rightarrow 0}{\frac{\Psi((1-\epsilon) P + \epsilon H) - \Psi(P)}{\epsilon}} \]Now we introduce the equivalent of the gradient as the influence function: \[ \Psi^{\prime}_{P}(x) \equiv D \Psi(P, \delta_x) \] with \( \delta_x \) being the Dirac delta function at \( x \). In words,this is asking "How quickly does \( \Psi \) shift as we move a little probability mass to the point \( x \)?"
It should be plausible, but we won't prove here, that \[ D \Psi(P,H) = H \Psi^{\prime}_P = \int{\Psi^{\prime}_P(x) H(dx)} \] (The second equality is just a reminder of what the de Finetti notation means.) This is the equivalent of saying that the gradient lets us calculate all the directional derivatives.
Exercise: Prove \[ P \Psi^{\prime}_P = 0 \] Notice: If we pretend that we have an Oracle which can evaluate expectations under \( P \), we can imagine using this as an estimating equation, keep tuning the \( Q \) in \( P \Psi^{\prime}_Q \) until we hit zero, and then (presumably) we've got \( Q = P \).
von Mises Expansions and asymptotic linearity
If we plug in an empirical distribution \( P_n \), the von Mises expansion is a first-order Taylor approximation around the true distribution: \[ \begin{eqnarray} \Psi(P_n) & \approx & \Psi(P) + (P_n - P) \Psi^{\prime}_P\\ & = & \Psi(P) + P_n \Psi^{\prime}_P - P \Psi^{\prime}_P\\ \Psi(P) & \approx & \Psi(P_n) - P_n \Psi^{\prime}_P\\ \sqrt{n}( \Psi(P_n) - \Psi(P) ) & \rightsquigarrow & \mathcal{N}(0, P(\Psi^{\prime}_P)^2) ~ \text{(CLT)}\\ \end{eqnarray} \] The third line is a sort of on-step adjustment to the plug-in estimator.An estimator \( \hat{\Psi} \) is asymptotically linear when \[ \hat{\Psi}(P_n) = \Psi(P) + P_n \hat{\Psi}^{\prime}_P + \mathrm{small} \] which implies, by the central limit theorem, that \[ \sqrt{n}(\hat{\Psi}(P_n) - \Psi(P)) \rightsquigarrow \mathcal{N}(0, \mathrm{Var}_{P}[\Psi^{\prime}_P]) \]
The targeted learning idea (take 2)
We know that \[ P \Psi^{\prime}_P = 0 \] so we hope that \[ P_n \Psi^{\prime}_P \approx 0 \] We could try to use that as an estimating equation, but solving for \( P \), over all possible probability measures \( \mathcal{P} \), is just too hard.We replace it with an easier problem. Start with a reasonable initial estimate of \( P \) based on the data, say \( M \). We now define a one-parameter exponential family of distributions "centered" on \( M \), with Radon-Nikodym derivatives \[ \frac{dM(\epsilon)}{dM}(x) = \frac{1}{z(\epsilon)}\exp{\left\{ \epsilon \Psi^{\prime}_M(x) \right\}} \] Now we solve \[ P_n \Psi^{\prime}_{M(\epsilon)} = 0 \] for \( \epsilon \), and use \( \Psi(M(\epsilon)) \) as our estimate of \( \Psi(P) \).
Note: the solution isn't automatically \( \epsilon = 0 \), because in general \( P_n \Psi^{\prime}_M \neq 0 \), we didn't build our initial estimate of \( P \) with this goal in mind.
So: We start with a guess \( M \) at \( P \). We find the influence function \( \Psi_M^{\prime}(x) \). We then adjust \( \epsilon \) until the sample average of \( \Psi^{\prime}_{M(\epsilon)} = 0 \). This is an asymptotically linear estimator.
This is also very similar to a maximum-entropy estimation idea: find the distribution \( R \) such that \( D(M \| R) \) is minimized, subject to \( R \Psi^{\prime}_M = 0 \).