Attention conservation notice: Puffery about a paper in statistical learning theory. Again, if you care, why not let the referees sort it out, and check back later?

Now this one I *really* do not have time to expound on (see:
talk in Chicago on Thursday), but since it is related to
the subject of that talk, I tell myself that doing so will help me with my
patter.

- Daniel J. McDonald, CRS, and Mark Schervish, "Estimated VC dimension for risk bounds", arxiv:1111.3404
*Abstract:*Vapnik-Chervonenkis (VC) dimension is a fundamental measure of the generalization capacity of learning algorithms. However, apart from a few special cases, it is hard or impossible to calculate analytically. Vapnik et al. [10] proposed a technique for estimating the VC dimension empirically. While their approach behaves well in simulations, it could not be used to bound the generalization risk of classifiers, because there were no bounds for the estimation error of the VC dimension itself. We rectify this omission, providing high probability concentration results for the proposed estimator and deriving corresponding generalization bounds.

VC dimension is one of the many ways of measuring how flexible a class of
models are, or their capacity to match data. Specifically, it is the largest
number of data-points which the class can always (seem to) match perfectly, no
matter how the observations turn out, by judiciously picking one model or
another from the class. It is called a "dimension" because, through some
clever combinatorics, this turns out to control the rate at which the number
of *distinguishable* models grows with the number of observations, just
as Euclidean dimension governs the rate at which the measure of a geometrical
body grows as its length expands. Knowing the number of effectively-distinct
models, in turn, tells us about over-fitting. The true risk of a fitted model
will generally be higher than its in-sample risk, precisely because it was
fitted to the data and so tuned to take advantage of noise. High-capacity
model classes can do more such tuning. One can actually bound the true risk in
terms of the in-sample risk and the effective number of models, and so in terms
of the VC dimension.

How then does one find the VC dimension? Well, the traditional route was through yet more clever combinatorics. As someone who has never quite gotten the point of the birthday problem, I find this unappealing, especially when the models are awkward and fractious, as the interesting ones generally are.

An alternative, due to Vapnik, Levin and LeCun, is to replace math with experiment. Roughly, the idea is this: make up simulated data, fit the model class to it, see how variable the fit is from run to run, and then plug this average discrepancy into a formula relating it to the VC dimension and the simulated sample size. Simulating at a couple of sample sizes and doing some nonlinear least squares then yields an estimate of the VC dimension, which is consistent in the right limits. (If you really want details, see the papers.)

The problem with the experimental approach is that it doesn't tell you how
to *use* the estimated VC dimension in a risk bound, which is after all
what you want it for. The estimate, after all, is not perfectly precise, and
how is one to account for that imprecision in the bound?

This turns out to be an eminently solvable problem. One can use the
estimated VC dimension, plus a comparatively-small-and-shrinking safety margin,
and plug it into the usual risk bound, with just a small hit to the confidence
level. Showing this qualitatively relies on the results in van de Geer's
Empirical
Processes in M-Estimation, which, pleasingly, was one of the first
books I read on statistical learning theory lo these many years ago. Less
pleasingly, getting everything needed for a bound we can *calculate*
(and weakening some assumptions) meant re-proving many of those results,
excavating awkward constants previously buried in big *C*'s and *K*'s
and little *o _{P}*'s.

In the end, however, all is as one would hope: estimated VC dimension concentrates (at a calculable rate) around the true VC dimension, and the ultimate risk bound is very close to the standard one. As someone who likes learning theory and experimental mathematics, I find this pleasing. It is also a step in Cunning Plan, which I will not belabor, as it will be obvious to readers who go all the way through the paper.

**Update**, 22 November: The title is a catch-phrase of my
mother's; I believe she got it from one of her biochemistry teachers.

Posted at November 15, 2011 21:25 | permanent link