Attention conservation notice: 1000 words on a short probability problem. Several hobby-horses get to take turns around the yard.

Wolfgang at Mostly Harmless poses a problem (I've lightly tweaked the notation):

Consider a random process \( X(t) \) which generates a series of 0s and 1s, but many more 0s because the probability for \( X(t) = 1 \) decreases with \( t \) as \( 2^{-t} \).Now assume that we encounter this process not knowing 'how far we are already', in other words we don't know the value of \( t \). The question is: "What is the probability to get a 1?"

Unfortunately there are two ways to answer this question. The first calculates the 'expectation value', as a physicist would call it, or 'the mean' as a statistician would put it, which is zero.

In other words, we sum over all possible \( t \) with equal weight and have to consider \( s = \sum_{t}{2^{-t}} \) with \( t = 1, 2, ... N \); It is not difficult to see that \( s = 1/2 + 1/4 + \ldots \) equals 1.

The answer is therefore \( \mathrm{Pr}(X=1) = s/N = 1/N \) and because \( N \) is infinite (the process never ends) we get \( \mathrm{Pr}(X=1) = 0 \).

The second answer simply looks at the definition of the process and points out that \( \mathrm{Pr}(X=1) = 2^{-T} \), where \( T \) is the current value of \( t \). Although we don't know \( T \) it must have some finite value and it is obvious that \( \mathrm{Pr}(X=1) > 0 \).

So which one is it, \( \mathrm{Pr}(X=1) = 0 \) or \( \mathrm{Pr}(X=1) > 0 \)?

*Fatwa*: The second
answer is correct, and the first is wrong.

*Discussion*: This is a cute example of the weirdness which results
when we attempt to put uniform distributions on infinite spaces, even in the
simplest possible case of the positive integers. The first way of proceeding
assumes that the notion of a uniform probability distribution on the natural
numbers makes sense, and that it obeys the same rules as an ordinary
probability distribution. Unfortunately, these two requirements are
incompatible. This is because ordinary probability distributions
are **countably additive**. We are all familiar with the fact
that probability adds across disjoint events: \( \mathrm{Pr}(X= 0 ~ \mathrm{or}
~ 1) = \mathrm{Pr}(X=0)+\mathrm{Pr}(X=1) \). Moreover, we are all comfortable
with the idea that this holds for more than two events. The probability that
\( X \) first \( =1 \) by time 3 is the sum of the probability of the first one
being at time \( t=1 \), plus it first being one at \( t=2 \), plus it being at
\( t=3 \). Carrying this out to any finite collection of disjoint events is
called **finite additivity**. However, as I said, probability
measures are ordinarily required to be *countably* additive, meaning
that this holds even for a countable infinity of disjoint events.

And here we have trouble. The natural numbers are (by definition!)
countable, so the probability of all integers is the sum of the probability of
each integer,
\[
\mathrm{Pr}(T ~ \mathrm{an} ~ \mathrm{integer}) = \sum_{t \in \mathbb{Z}}{\mathrm{Pr}(T=t)}
\]

The left-hand side must be 1. For a uniform distribution, we expect that
all the terms in the sum on the right-hand side must be equal, otherwise it's
not "uniform". But either all the terms are equal and positive, in which case
the right-hand side is infinite, or all the terms are equal and zero, in which
case the right-hand side is zero. Hence, there is no countably-additive
uniform probability measure on the integers, and the first approach, which
leads to the conclusion that \( \mathrm{Pr}(X(T)=1)=0 \), is mathematically
incoherent.

Now, there are such things as finitely-additive probability measures, but they are rather weird beasts. To specify one of them on the integers, for example, it's not enough to give the probability of each integer (as it is for a countably-additive measure); that only pins down the probability of finite sets, and sets whose complements are finite. It does not, for example, specify the probability of the even numbers. There turn out to be several different ways of defining uniform distributions on the natural numbers, which are not equivalent. Under all of them, however, any finite set must have probability zero, and so at a random time \( T \), it is almost certain that \( \mathrm{Pr}(X(T)=1) \) is less than any real number you care to name. Hence, the expectation value of this random probability is indeed zero.

(Notice, however, that if I try to calculate the expectation value
of *any* function \( f(t) \) by taking a probability-weighted sum over
values of \( t \), as the first answer does, I will get the answer 0 when \( T
\) follows a uniform finitely-additive measure, even if \( f(t)=1 \) for all \(
t \). The weighted-sum-of-arguments definition of expectation — the one
reminiscent of Riemann
integrals — does not work for these measures. Instead one must use a
Lebesgue-style
definition, where one takes a weighted sum of the *values* of the
function, the weights being the measures of the sets giving those values.
[More exactly, one partitions the range of *f* and takes the limit as the
partition becomes finer and finer.] The equivalence of the summing over the
domain and summing over the range turns on, precisely, countable additivity.
The argument in the previous paragraph shows that here this expectation value
must be less than any positive number, yet not negative, hence zero.)

Finitely-additive probability measures
are profoundly
weird beasts, though some of my colleagues have what I can only consider
a perverse
affection for them. On the other hand, attempts to construct a natural
countably-additive analog of a uniform distribution on infinite sets have
been universally
unsuccessful;
this very much
includes the maximum entropy approach. The project of giving ignorance a
unique representation as a probability measure is, IMSAO, a failure. If one
picks *some* countably-additive prior distribution over the integers,
however, then at least one value of \( t \) must have strictly positive
probability, and the expectation value of \( \mathrm{Pr}(X(T)=1) \) is
positive, though how big it is will depend on the prior distribution.
(As usual, the role of a Bayesian prior distribution is
to introduce bias so as to reduce variance.) Alternately, one simply follows
the second line of reasoning and concludes that, no matter what \( t \) might
be, the probability is positive.

**Update**, 30 June 2013: See
also Lost
Causes in Statistics I: Finite Additivity.

Posted at December 08, 2009 10:55 | permanent link