Attention conservation notice: Puffery about a paper in statistical learning theory. Again, if you care, why not let the referees sort it out, and check back later?
Now this one I really do not have time to expound on (see: talk in Chicago on Thursday), but since it is related to the subject of that talk, I tell myself that doing so will help me with my patter.
VC dimension is one of the many ways of measuring how flexible a class of models are, or their capacity to match data. Specifically, it is the largest number of data-points which the class can always (seem to) match perfectly, no matter how the observations turn out, by judiciously picking one model or another from the class. It is called a "dimension" because, through some clever combinatorics, this turns out to control the rate at which the number of distinguishable models grows with the number of observations, just as Euclidean dimension governs the rate at which the measure of a geometrical body grows as its length expands. Knowing the number of effectively-distinct models, in turn, tells us about over-fitting. The true risk of a fitted model will generally be higher than its in-sample risk, precisely because it was fitted to the data and so tuned to take advantage of noise. High-capacity model classes can do more such tuning. One can actually bound the true risk in terms of the in-sample risk and the effective number of models, and so in terms of the VC dimension.
How then does one find the VC dimension? Well, the traditional route was through yet more clever combinatorics. As someone who has never quite gotten the point of the birthday problem, I find this unappealing, especially when the models are awkward and fractious, as the interesting ones generally are.
An alternative, due to Vapnik, Levin and LeCun, is to replace math with experiment. Roughly, the idea is this: make up simulated data, fit the model class to it, see how variable the fit is from run to run, and then plug this average discrepancy into a formula relating it to the VC dimension and the simulated sample size. Simulating at a couple of sample sizes and doing some nonlinear least squares then yields an estimate of the VC dimension, which is consistent in the right limits. (If you really want details, see the papers.)
The problem with the experimental approach is that it doesn't tell you how to use the estimated VC dimension in a risk bound, which is after all what you want it for. The estimate, after all, is not perfectly precise, and how is one to account for that imprecision in the bound?
This turns out to be an eminently solvable problem. One can use the estimated VC dimension, plus a comparatively-small-and-shrinking safety margin, and plug it into the usual risk bound, with just a small hit to the confidence level. Showing this qualitatively relies on the results in van de Geer's Empirical Processes in M-Estimation, which, pleasingly, was one of the first books I read on statistical learning theory lo these many years ago. Less pleasingly, getting everything needed for a bound we can calculate (and weakening some assumptions) meant re-proving many of those results, excavating awkward constants previously buried in big C's and K's and little oP's.
In the end, however, all is as one would hope: estimated VC dimension concentrates (at a calculable rate) around the true VC dimension, and the ultimate risk bound is very close to the standard one. As someone who likes learning theory and experimental mathematics, I find this pleasing. It is also a step in Cunning Plan, which I will not belabor, as it will be obvious to readers who go all the way through the paper.
Update, 22 November: The title is a catch-phrase of my mother's; I believe she got it from one of her biochemistry teachers.
Posted at November 15, 2011 21:25 | permanent link