Attention conservation notice: 1000 words on a short probability problem. Several hobby-horses get to take turns around the yard.
Wolfgang at Mostly Harmless poses a problem (I've lightly tweaked the notation):
Consider a random process \( X(t) \) which generates a series of 0s and 1s, but many more 0s because the probability for \( X(t) = 1 \) decreases with \( t \) as \( 2^{-t} \).Now assume that we encounter this process not knowing 'how far we are already', in other words we don't know the value of \( t \). The question is: "What is the probability to get a 1?"
Unfortunately there are two ways to answer this question. The first calculates the 'expectation value', as a physicist would call it, or 'the mean' as a statistician would put it, which is zero.
In other words, we sum over all possible \( t \) with equal weight and have to consider \( s = \sum_{t}{2^{-t}} \) with \( t = 1, 2, ... N \); It is not difficult to see that \( s = 1/2 + 1/4 + \ldots \) equals 1.
The answer is therefore \( \mathrm{Pr}(X=1) = s/N = 1/N \) and because \( N \) is infinite (the process never ends) we get \( \mathrm{Pr}(X=1) = 0 \).
The second answer simply looks at the definition of the process and points out that \( \mathrm{Pr}(X=1) = 2^{-T} \), where \( T \) is the current value of \( t \). Although we don't know \( T \) it must have some finite value and it is obvious that \( \mathrm{Pr}(X=1) > 0 \).
So which one is it, \( \mathrm{Pr}(X=1) = 0 \) or \( \mathrm{Pr}(X=1) > 0 \)?
Fatwa: The second answer is correct, and the first is wrong.
Discussion: This is a cute example of the weirdness which results when we attempt to put uniform distributions on infinite spaces, even in the simplest possible case of the positive integers. The first way of proceeding assumes that the notion of a uniform probability distribution on the natural numbers makes sense, and that it obeys the same rules as an ordinary probability distribution. Unfortunately, these two requirements are incompatible. This is because ordinary probability distributions are countably additive. We are all familiar with the fact that probability adds across disjoint events: \( \mathrm{Pr}(X= 0 ~ \mathrm{or} ~ 1) = \mathrm{Pr}(X=0)+\mathrm{Pr}(X=1) \). Moreover, we are all comfortable with the idea that this holds for more than two events. The probability that \( X \) first \( =1 \) by time 3 is the sum of the probability of the first one being at time \( t=1 \), plus it first being one at \( t=2 \), plus it being at \( t=3 \). Carrying this out to any finite collection of disjoint events is called finite additivity. However, as I said, probability measures are ordinarily required to be countably additive, meaning that this holds even for a countable infinity of disjoint events.
And here we have trouble. The natural numbers are (by definition!)
countable, so the probability of all integers is the sum of the probability of
each integer,
\[
\mathrm{Pr}(T ~ \mathrm{an} ~ \mathrm{integer}) = \sum_{t \in \mathbb{Z}}{\mathrm{Pr}(T=t)}
\]
The left-hand side must be 1. For a uniform distribution, we expect that
all the terms in the sum on the right-hand side must be equal, otherwise it's
not "uniform". But either all the terms are equal and positive, in which case
the right-hand side is infinite, or all the terms are equal and zero, in which
case the right-hand side is zero. Hence, there is no countably-additive
uniform probability measure on the integers, and the first approach, which
leads to the conclusion that \( \mathrm{Pr}(X(T)=1)=0 \), is mathematically
incoherent.
Now, there are such things as finitely-additive probability measures, but they are rather weird beasts. To specify one of them on the integers, for example, it's not enough to give the probability of each integer (as it is for a countably-additive measure); that only pins down the probability of finite sets, and sets whose complements are finite. It does not, for example, specify the probability of the even numbers. There turn out to be several different ways of defining uniform distributions on the natural numbers, which are not equivalent. Under all of them, however, any finite set must have probability zero, and so at a random time \( T \), it is almost certain that \( \mathrm{Pr}(X(T)=1) \) is less than any real number you care to name. Hence, the expectation value of this random probability is indeed zero.
(Notice, however, that if I try to calculate the expectation value of any function \( f(t) \) by taking a probability-weighted sum over values of \( t \), as the first answer does, I will get the answer 0 when \( T \) follows a uniform finitely-additive measure, even if \( f(t)=1 \) for all \( t \). The weighted-sum-of-arguments definition of expectation — the one reminiscent of Riemann integrals — does not work for these measures. Instead one must use a Lebesgue-style definition, where one takes a weighted sum of the values of the function, the weights being the measures of the sets giving those values. [More exactly, one partitions the range of f and takes the limit as the partition becomes finer and finer.] The equivalence of the summing over the domain and summing over the range turns on, precisely, countable additivity. The argument in the previous paragraph shows that here this expectation value must be less than any positive number, yet not negative, hence zero.)
Finitely-additive probability measures are profoundly weird beasts, though some of my colleagues have what I can only consider a perverse affection for them. On the other hand, attempts to construct a natural countably-additive analog of a uniform distribution on infinite sets have been universally unsuccessful; this very much includes the maximum entropy approach. The project of giving ignorance a unique representation as a probability measure is, IMSAO, a failure. If one picks some countably-additive prior distribution over the integers, however, then at least one value of \( t \) must have strictly positive probability, and the expectation value of \( \mathrm{Pr}(X(T)=1) \) is positive, though how big it is will depend on the prior distribution. (As usual, the role of a Bayesian prior distribution is to introduce bias so as to reduce variance.) Alternately, one simply follows the second line of reasoning and concludes that, no matter what \( t \) might be, the probability is positive.
Update, 30 June 2013: See also Lost Causes in Statistics I: Finite Additivity.
Posted at December 08, 2009 10:55 | permanent link