Expressive Power vs. Learnability
Last update: 07 Dec 2024 23:58
First version: 31 August 2024
(Writing a thought down to get it out of my head and come back to later)
Not too long ago as I write (31 August 2024), I ran across a paper or blog
post (or something) making the point, IIRC in passing, that it's one
thing for a model class to be able to express a wide range of
functions, and quite another for that model class (+ some training procedure)
to be able to learn those functions. Thus since the 1990s we've
known, and often been very taken with, the fact that neural networks with 3+
layers are universal function approximators, i.e., any reasonable function \(
f: \mathbb{R}^{in} \mapsto \mathbb{R}^{out} \) can be approximated arbitrarily
closely by some set of weights, if the hidden/middle layer is wide
enough. But this doesn't mean that if we're given a big collection of
input/output pairs from \( f \) that a randomly initialized network will
converge on it through gradient descent (a.k.a. "back-propagation").
In fact I want to pick up my camel-hair brush and draw some finer distinctions.
- An exact realization of the function exists in the space.
- The function can be approximated arbitrarily closely (in some suitable metric) within the space. ("Universal approximation" results live here.)
Up to this point, these are just mathematical questions about function spaces. Now introduce some probabilistic considerations:
- In a distribution where the function is exactly correct, the function is the (unique) minimizer of the expected loss (risk) over the function space. (This is "Fisher consistency", and requires a choice of loss function.)
- There is no minimum of the risk within the function space, but any sequence of functions within the space whose risk approaches the infimum will approach the true function arbitrarily closely. (This combines universal approximation with Fisher consistency; it requires both a metric over functions and loss function.)
Now move from probability to statistical theory:
- With finite data, the sequence of minimizers of the average loss (= empirical risk) converges in probability on the true function. (This is "consistency".) --- Optionally: regularized empirical risk minimization converges, etc., etc.
Up to this point, computational considerations have not really entered in to the picture. (To repeat a joke from my classes, "argmin" is a theoretical statistician's way of saying "that's not my problem".) But now:
- With finite data, some algorithm (usually a form of optimization) will deliver a sequence of estimates which converge in probability on the true function.
- With finite data, some tractable algorithm will deliver a sequence of estimates which converge in probability on the true function.
- With finite data, this nigh-universally-used optimization (gradient descent, stochastic gradient descent, ....) algorithm will deliver a sequence of estimates which converge in probability on the true function.
These last items suggest, at least to me, some sort of dynamical system on the function space...
I have no profound points to make here or even references to list. I may or may not come back to this.