Proper Scoring Rules

Last update: 31 Mar 2026 09:20
First version: 30 March 2026

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

Attention conservation notice: Re-purposed teaching material (down to the horizontal rule).

We want to guess at $Y$, a binary random variable that's either 0 or 1, with $ \Prob{Y=1}=p $, but we want to predict a probability $ 0 \leq q \leq 1 $ that $ Y=1 $. After we make our prediction, we incur a loss $ \ell(Y, q) $. Examples include:

the absolute loss $ \ell(Y,q) = |Y-q| $;
the squared error loss $\ell(Y,q) = (Y-q)^2$;
- Also called, in this context, the "Brier score"
and the negative log probability loss $ \ell(Y,q) = -y\log{q} - (1-y)\log{(1-q)} $.
- A.k.a. "log probability loss", "log loss", and other synonyms.

A loss function, for this kind of probability prediction, is called a proper scoring function, and leads to a proper scoring rule, when $ \Expect{\ell(Y, q)} $ is uniquely minimized by setting $ q=p $.

Show that the squared error loss is a proper scoring function.
Show that the log probability loss is a proper scoring function. Hint: show that $ \frac{x}{1-x} $ is invertible for $x \in [0,1]$.
Show that the absolute loss is minimized by setting $q=1$ if $p>1/2$ and $q=0$ if $p<1/2$, so it is not a proper scoring function.
Proper scoring and accuracy: Alice and Babur are both asked to predict a sequence of independent and identically distributed $ Y_1, Y_2, \ldots Y_n, \ldots $. Suppose that $ \Prob{Y_i=1} = p $ (for all $i$); that Alice thinks the probability that $ Y_i=1 $ is in fact $p$; and that Babur thinks that the probability that $ Y_i=1 $ is $ q \neq p $. Both Alice and Babur are evaluated using a proper scoring rule --- the same proper scoring rule. Explain why, in the long run, Alice will have a lower (that is, better) score than Babur.
Proper scoring and honesty: Babur truly believes the probability that $ Y=1 $ is $p$, but he is tempted to publicly predict a different probability $q$. Show that, if Babur is evaluated using a proper scoring rule, his expectation of his score is minimized by predicting $ q=p $ (that is, by honestly reporting his true belief, and not something else).
Proper scoring and calibration: A probability forecast is calibrated when the events it claims have probability $q$ actually happen with frequency $q$. (It snows on 90% of the days where the forecast claims a 90% chance of snow, and so forth.) Explain the connection between calibration and proper scoring rules.

All of this extends to more-than-binary outcomes, and to conditional probabilities, with only some notational overhead.

For what proper scoring rules is it sufficient to know the distribution only up to a constant factor, so that we do not need to calculate the normalizing factor for the distribution?
If the distribution is normalized, what are the best reasons to use a proper scoring function other than log probability?

Embarrassingly, for teaching, I have no earthly recollection of where I learned about this subject.

Information Theory
Statistics

Tilmann Gneiting, Fadoua Balabdaoui and Adrian E. Raftery, "Probabilistic Forecasts, Calibration and Sharpness", Journal of the Royal Statistical Society B 69 (2007): 243--268
Johanna F. Ziegel and Tilmann Gneiting, "Copula Calibration", arxiv:1307.7650

Jonas R. Brehmer, Kirstin Strokorb, "Why scoring functions cannot assess tail properties", Electronic Journal of Statistics 13 (2019): 4015--4034
A. Philip Dawid, Steffen Lauritzen, Matthew Parry, "Proper local scoring rules on discrete sample spaces", Annals of Statistics 40 (2012): 593--608
A. Philip Dawid and Monica Musio, "Theory and Applications of Proper Scoring Rules", Metron 72 92014): 169--183 arxiv:1401.0398
Philip Dawid, Monica Musio, and Laura Ventura, "Minimum scoring rule inference", Scandinavian Journal of Statistics 43 (2016): 123--138, arxiv:1403.3920
Hailiang Du, "Beyond Strictly Proper Scoring Rules: The Importance of Being Local", arxiv:2012.12499
Werner Ehm, and Tilmann Gneiting, "Local proper scoring rules of order two", Annals of Statistics 40 (2012): 609--637
Luciana Ferrer, Daniel Ramos, "Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration", arxiv:2408.02841
Tilmann Gneiting and Matthias Katzfuss, "Probabilistic Forecasting", Annual Review of Statistics and Its Application 1 (2014): 125--151
Alexander Henzi, Johanna F. Ziegel, "Valid sequential inference on probability forecast performance", arxiv:2103.08402
Alexander Jordan, Fabian Krüger, and Sebastian Lerch, "Evaluating Probabilistic Forecasts with scoringRules", Journal of Statistical Software 90:12 (2019): 1--37
Tze Leung Lai, Shulamith T. Gross, David Bo Shen, "Evaluating probability forecasts", Annals of Statistics 39 (2011): 2356--2382, arxiv:1202.5140
Matthew Parry, A. Philip Dawid, Steffen Lauritzen, "Proper local scoring rules", Annals of Statistics 40 (2012): 561--592
Joel Predd, Robert Seiringer, Elliott H. Lieb, Daniel Osherson, Vincent Poor, Sanjeev Kulkarni, "Probabilistic coherence and proper scoring rules", IEEE Transactions on Information Theory 55 (2009): 4786, arxiv:0710.3183
Alexander Tsyplakov, "Evaluation of Probabilistic Forecasts: Proper Scoring Rules and Moments", MPRA 45186 (2013)
Florian Ziel, Kevin Berk, "Multivariate Forecasting Evaluation: On Sensitive and Strictly Proper Scoring Rules", arxiv:1910.07325