Notebooks

Proper Scoring Rules

Last update: 31 Mar 2026 09:20
First version: 30 March 2026

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Attention conservation notice: Re-purposed teaching material (down to the horizontal rule).

We want to guess at $Y$, a binary random variable that's either 0 or 1, with $ \Prob{Y=1}=p $, but we want to predict a probability $ 0 \leq q \leq 1 $ that $ Y=1 $. After we make our prediction, we incur a loss $ \ell(Y, q) $. Examples include:

A loss function, for this kind of probability prediction, is called a proper scoring function, and leads to a proper scoring rule, when $ \Expect{\ell(Y, q)} $ is uniquely minimized by setting $ q=p $.
    Exercises:
  1. Show that the squared error loss is a proper scoring function.
  2. Show that the log probability loss is a proper scoring function. Hint: show that $ \frac{x}{1-x} $ is invertible for $x \in [0,1]$.
  3. Show that the absolute loss is minimized by setting $q=1$ if $p>1/2$ and $q=0$ if $p<1/2$, so it is not a proper scoring function.
  4. Proper scoring and accuracy: Alice and Babur are both asked to predict a sequence of independent and identically distributed $ Y_1, Y_2, \ldots Y_n, \ldots $. Suppose that $ \Prob{Y_i=1} = p $ (for all $i$); that Alice thinks the probability that $ Y_i=1 $ is in fact $p$; and that Babur thinks that the probability that $ Y_i=1 $ is $ q \neq p $. Both Alice and Babur are evaluated using a proper scoring rule --- the same proper scoring rule. Explain why, in the long run, Alice will have a lower (that is, better) score than Babur.
  5. Proper scoring and honesty: Babur truly believes the probability that $ Y=1 $ is $p$, but he is tempted to publicly predict a different probability $q$. Show that, if Babur is evaluated using a proper scoring rule, his expectation of his score is minimized by predicting $ q=p $ (that is, by honestly reporting his true belief, and not something else).
  6. Proper scoring and calibration: A probability forecast is calibrated when the events it claims have probability $q$ actually happen with frequency $q$. (It snows on 90% of the days where the forecast claims a 90% chance of snow, and so forth.) Explain the connection between calibration and proper scoring rules.
All of this extends to more-than-binary outcomes, and to conditional probabilities, with only some notational overhead.

Embarrassingly, for teaching, I have no earthly recollection of where I learned about this subject.


Notebooks:   Powered by Blosxom