VC Dimension and Evolving Cellular Automata
27 Feb 2017 16:30
In the Evolving Cellular Automata Project (and related work elsewhere, but it never hurts to plug places you used to work), one-dimensional CA rules of fixed radius were evaluated on their ability to correctly classify the density of their initial strings --- i.e., how often they went to all zeroes when the initial condition was mostly zeroes, and to all ones when it was mostly ones. (The lattice size was always odd.) Every initial condition was statistically independent of every other, and all were identically distributed. A genetic algorithm then selected the higher-fitness rules, and introduced variations in the hope of finding yet fitter ones.
Vapnik-Chervonenkis learning theory gives results which say how many IID samples are needed to guarantee, at a certain level of confidence, that the model which minimizes the in-sample risk comes within a certain distance of the actual minimum risk. The results hold good regardless of the actual sample-generating distribution, provided it's IID. In the case where risk is the error rate in binary classification, the bounds are particularly simple, and simplified yet further if one has only a finite hypothesis space, as is the case for CA rules of fixed radius. (It might be interesting to try to calculate the VC dimension of the whole set of binary 1D CA rules, regarded as classifiers, but I'm not feeling ambitious.)
Query: How does the rate of improvement of the fittest member of the population compare to the rate of improvement of the VC bounds, as a function of the number of initial conditions evaluated? I could see the GA being faster, because the VC bounds are distribution-free, and the actual distribution of examples might be easier to learn than the hardest distribution. Or the GA could be slower, because evolution really is a very messy and redundant process. Or the GA could start out slower but overtake the VC bounds, as the population becomes, in some sense, adapted to the distribution of training examples. And if the GA starts out faster but falls behind, well, that'd just be bizarre and really need explanation. So there's an interesting result here no matter what!
If I remember the EvCA code correctly, each member of the population got a different, randomly generated set of initial conditions, which might mess with the VC results, but it'd only be a small tweak to make all the members of the population in a given generation use the same set of initial conditions.
(After conversation with Bill Rand, 2004 or 2005?)
Update, 24 March 2010: Xiaojin Zhu et al. ("Human Rademacher Complexity", in NIPS 2009) [my commentary]) did something analogous for actual human subjects, using Rademacher complexity rather than shattering coefficients. Roughly speaking, the Rademacher complexity of a set of functions is their expected maximal correlation to pure noise sequences of a certain type. It's actually related to VC dimension, but easier to estimate, and leads to similar bounds on over-fitting. Zhu et al. found that their human subjects respected the bounds (of course!), but actually did rather better than that, though there was some correlation between the actual over-fitting and the Rademacher complexity.