*Incorporated by
reference:* Workshop
on Foundations for Ockham's Razor; Ockham Workshop, Day
1; "Tlön,
Uqbar, Orbis Tertius"

How could one do other than submit to Tlön, to the minute and vast evidence of an orderly plan?

The theme of the morning was that Ockham's razor is *not* a useful
principle because the truth is inherently simple, or because the truth is apt
to be simple, or because simplicity is more plausible, or even because simple
models predict better. Rather, the Razor helps us get to the
truth *faster* than if we multiplied complexities without necessity,
even when the truth is actually rather complex.

There are no nouns in Tlön's conjecturalUrsprache, from which the "present" languages and the dialects are derived: there are impersonal verbs, modified by monosyllabic suffixes (or prefixes) with an adverbial value. For example: there is no word corresponding to the word "moon,", but there is a verb which in English would be "to moon" or "to moonate." "The moon rose above the river" ishlor u fang axaxaxas mlo, or literally: "upward behind the onstreaming it mooned."

Oliver Schulte talked about devising learning algorithms which minimize the number of "mind changes", times when the algorithm, working incrementally through data, changes the hypothesis it outputs. (The algorithm is allowed to say "situation cloudy, try again later", but forced to converge eventually.) To do this, say that hypothesis A is simpler than hypothesis B when there are observations which will falsify A but not B, and none which falsify B but not A. For instance, "all swans are edible" is simpler than "some swans are poisonous to all mammals". (Exercise: is "some swans are poisonous to all mammals" more or less simple than "some swans are poisonous to all hedgehogs"?) Now an Ockham learner is one which outputs the simplest hypothesis compatible with the data so far, or "situation cloudy" if there are multiple viable simplest hypotheses. It's now a theorem that an Ockham learner has fewer worst-case mind-changes than any other consistent strategy. Schulte illustrated this with an example of trying to learn conserved quantities from positive examples of observed reactions among elementary particles, along these lines.

A lively discussion followed as to whether the hypothesis that "all emeralds are green" is simpler, in this sense, than "all emeralds are grue" (consensus: yes, if the data are described in the green/blue language; but no, in the grue/bleen language), and whether minimum description length approaches would fair any better (consensus: yes, but no).

The preceding applies to the languages of the southern hemisphere. In those of the northern hemisphere ... the prime unit is not the verb, but the monosyllabic adjective. The noun is formed by an accumulation of adjectives. They do not say "moon," but rather "round airy-light on dark" or "pale-orange-of-the-sky" or any other such combination. In the example selected the mass of adjectives refers to a real object, but this is purely fortuitous.

Kevin Kelly is (so far as I know) the originator of this efficient-convergence approach to the Razor, including the notion of simplicity which Schulte employed. He talked about new work on refining and extending this simplicity measure, or rather setting out a system of axioms for the "simpler than" relation. This then leads to proofs that Ockham strategies are worst-case optimal in a wide range of situations, under an array of new loss functions (beyond mind-changes). I will be honest and say that I did not follow his new axiomatic description of simplicity well enough to be able to reproduce it on a blog. It's claimed that in the new axiom system, the ordering of graphical models by simplicity exactly matches the ordering among Markov equivalence classes by inclusion of conditional independence relations, which is very interesting, but it remains unclear to me how to make the whole scheme work in a really statistical set-up (as opposed to, say, interval-valued data).

...ideal objects, which are convoked and dissolved in a moment, according to poetic needs. At times they are determined by mere simultaneity. There are objects composed of two terms, one of visual and another of auditory character: the color of the rising sun and the faraway cry of a bird. There are objects of many terms: the sun and the water on a swimmer's chest, the vague tremulous rose color we see with our eyes closed, the sensation of being carried along by a river and also by sleep. These second-degree objects can be combined with others; through the use of certain abbreviations, the process is practically infinite.

The first talk after lunch was a confused mass of equivocations as to the status of latent states (inferred? constructed?), only tenuously invoked Occam's Razor, and not especially well connected either to philosophy or to data analysis, but rather one of the most conspicuously useless areas of science.

the universe is comparable to those cryptographs in which not all the symbols are valid and that only what happens every three hundred nights is true

Hannes Leeb
talked about aspects of statistical inference for regression models
where *which* model we use is picked by model selection, instead of
being given to us; he stuck to linear regression with Gaussian noise for
simplicity. The
first part of the talk considered the situation where the number of
predictor variables is constant but the sample size \( n \) goes to infinity.
Here the interesting situation is when the coefficients \( \beta \) are small,
confined within a ball of radius \( O(1/\sqrt{n}) \) around the origin, since
this is precisely when model selection might be most useful. (If the
coefficients are much bigger than that, very crude statistics tells you what
they are.) What he proved is that there are always some values of \( \beta \)
inside this ball which will frustrate any attempt to even approximate the
sampling distribution of post-selection estimates, for basically any
model-selection procedure. Thus even trying to get confidence sets for \(
\widehat{\beta} \) is hopeless.

On the other hand, he was able
to give strong guarantees on
the *predictive performance* (not parameter estimates) of reasonable
model-selection estimators in the case where the number of relevant predictors
is actually infinite, but \( n \rightarrow \infty \). The trick, as I understand
it (there was again quite a bit of discussion) is that the bad behavior in the
fixed-dimension case doesn't happen for every value of \( \beta \), but rather
a set which depends on the design matrix, and becomes in some sense thinner and
thinner as the dimension grows. In the infinite-dimensional case, if we just
want to make predictions and don't really care about parameter estimates, it
turns out to be possible to say an awful lot about the distribution of
prediction errors.

The high-dimensional results rested on assuming that all predictor variables, and the response, are drawn from a single multivariate Gaussian distribution. Gaussianity was needed for two things: tight tail bounds, and getting to believe that conditional expectations are linear and conditional variances are constant. Trickier empirical process theory can substitute for Gaussian distributions as far as tail bounds go, but what about linearity and homoskedasticity?

Think of a \( d \)-dimensional random vector \( Z \) with mean zero, identity covariance , bounded densities, independent entries, take \( d \rightarrow \infty \) . Take projections \( \alpha \cdot Z \) and \( b \cdot Z \). Look at the conditional mean and variance of \( Y \equiv \alpha \cdot Z \) given \( X \equiv \beta \cdot Z \). Hold \( \alpha \) fixed and ask whether \( \beta \) is such that (i) \( \mathbb{E}[Y|X] \) is linear in \( X \), and (ii) \( \mathbb{V}[Y|X] \) is constant in \( X \), so that we can write \( Y = bX + U \), with \( U \) uncorrelated and homoskedastic noise, even though the true data-generating process is \( Y = \alpha \cdot Z \). It turns out that for each \( \alpha \), if we pick \( \beta \) uniformly over the unit sphere, the probability that (i) and (ii) hold, at least to within \( \pm \epsilon \), tends to 1 as \( d \rightarrow \infty \), no matter how small we make \( \epsilon \). This extends to multiple projections on the "right-hand side", i.e., \( X_1 = \beta_1 \cdot Z, X_2 = \beta_2 \cdot Z, \ldots \). This suggests that the assumptions needed for the high-dimensional model-selection results to work well are actually fairly generic, and not really tied to Gaussianity.

I shall venture to request a few minutes to expound its concept of the universe...

Malcolm Forster tried to explicate some historical uses of Occam's Razor (not called that then) from Copernican Revolution. He began with the observation that essentially arbitrary curves can be approximated to arbitrary position by means of piling enough epicycles together (see, e.g.), which is just Fourier analysis (or perhaps, in light of priority, Tusi analysis). But Copernicus was just as much about piling epicycles on the basic circle as any Ptolemaic model, so why (if at all?) was Copernicus better?

One way was not just epicycle counting, but epicycle-coincide counting.
Take your favorite Copernican, heliocentric model, with some finite number of
epicycles. For each planet, there is a Ptolemaic, geocentric sub-model which
can match the predicted motion of that planet (to arbitrary accuracy).
However, if the same Ptolemaic model is to match the behavior
of *multiple* planets, the Ptolemaic model is going to demand more
epicycles. Worse, it is going to demand an otherwise-pointless coincidence
among the epicycles of different planets. If we take a Copernican model with
the Earth, Mars and Jupiter all going in circles around the Sun, it can be
matched by a Ptolemaic model in which the Sun circles the Earth, and Mars and
Jupiter circle the Earth with one epicycle each. But the period of the Sun's
circle around the Earth must match those of the epicycles of Mars and Jupiter
--- basically, every planet must get an epicycle which replicates the Earth's
orbit. (If I followed Forster right, this point was made by Kepler, not
Copernicus.) Forster imagined a "data-driven Ptolemaic astronomer, 'Ptolemy+'
" — though surely that should be "Ptolemy++"? — who, on
model-selection grounds (e.g., AIC), preferred using a geocentric model where
those three circles *had to* have the same period. Ptolemy++ would give
the same predictions for the positions of the planets in the sky as Copernicus,
but would have no *physical* explanation for the parameter coincidence,
or any ability to *explain* why (as was known at the time of Copernicus)
the outer planets seem to go into retrograde motion only when opposed to the
Sun, or why the inner planets show phases (which wasn't known until the
telescope). Taking the Copernican model realistically, rather than the
instrumentalist Ptolemy++, would have been (was?) more *fruitful*.

I will not attempt to summarize what Forster said about the contrast between
Copernican epicycles and Kepler's ellipses, or Newton's unification of Kepler
and Galileo. I was struck by the observation that when Newton talked about
"causes" and "effects", he didn't mean *events* or
even *variables*, but something more like (what we would
call) *mechanisms* (for causes) and *phenomena* (for effects).
Also, there's evidently a passage in the Principia which is
evidently a near-quotation of Occam's
own *Frustra fit per
plura quod potest fieri per pauciora*, "it is vain to do with more what
could be done with less".

There followed another heated discussion, of whether machine learning algorithms could have discovered heliocentricity.

it has been established that all works are the creation of one author, who is atemporal and anonymous. The critics often invent authors: they select two dissimilar works — the Tao Te Ching and the 1001 Nights, say — attribute them to the same writer and then determine most scrupulously the psychology of this interestinghomme de lettres...

The theory of giving driving directions; whether an independent and identically-distributed sequence is simpler than a homogeneous Markov chain, or vice versa; whether a physical theory that makes many very detailed assertions is simple or complex; parsimony as an argument for panpsychism; the lasso's grue problem; Galileo's surprisingly-persuasive thought experiment for spherical inertia; why Kepler had noisier (but not smaller) residuals than Ptolemy; whether Ptolemy++ should have used AIC or the lasso.

Posted at June 23, 2012 23:59 | permanent link