### The True Price of Models Pulling Themselves Up by Their Bootstraps

For a project I just finished, I produced this
figure:

I don't want to give away too much about the project (

**update**, 19 April: it's now

public), but
the black curve is
a

smoothing
spline which is trying to predict the random
variable

*R*_{t+1} from

*R*_{t}; the
thin blue lines are 800 additional splines, fit to
800

bootstrap
resamplings of the original data; and the thicker blue lines are the resulting
95% confidence bands for the regression curve [1]. (The tick marks on the
horizontal axis show the actual data values.) Making this took about ten
minutes on my laptop, using
the

boot
and

mgcv
packages in

R.

The project gave me an excuse to finally
read Efron's
original paper on the
bootstrap, where my eye was caught by "Remark A" on p. 19 (my linkage):

Method 2, the straightforward calculation of the bootstrap
distribution by repeated Monte Carlo sampling, is remarkably easy to
implement on the computer. Given the original algorithm for computing
*R*, only minor modifications are necessary to produce bootstrap
replications *R*^{*1}, *R*^{*2},
..., *R*^{*N}. The amount of computer time required is
just about *N* times that for the original computations. For the
discriminant analysis problem reported in Table 2, each trial of *N* = 100
replications, [sample size] *m* = *n* = 20, took about 0.15 seconds
and cost about 40 cents on Stanford's 370/168 computer. For a single real data
set with *m* = *n* = 20, we might have taken *N*=1000, at a cost
of \$4.00.

My bootstrapping used *N* = 800, *n* = 2527.
Ignoring the differences between fitting Efron's linear classifier and my
smoothing spline, creating my figure would have cost \$404.32 in 1977, or
\$1436.90 in today's dollars (using
the consumer
price index). But I just paid about \$2400 for my laptop, which will have a
useful life of (conservatively) three years, a ten-minute pro rata share of
which comes to 1.5 cents.

The inexorable
economic logic of the price mechanism forces me to conclude that
bootstrapping is about 100,000 times less valuable for me now than it was for
Efron in 1977.

*Update*: Thanks to D.R. for catching a typo.

[1]: Yes, yes, unless the real regression function is
a smooth piecewise cubic there's some approximation bias from using splines, so
this is really a confidence band for the optimal spline approximation to the
true regression curve. I hope you are as scrupulous when people talk about
confidence bands for "the" slope of their linear regression models. (Added 7
March to placate quibblers.)

Enigmas of Chance

Posted at March 04, 2010 13:35 | permanent link