When 80% Should Mean 80%
Probability calibration is one of those topics that are really important in Machine Learning but under-covered and not always applied despite the tooling being heavily available. In my experience many data scientists, ML teams and team leads ignore this powerful tool and that usually causes a misalignment with business metrics.
A probability is deemed calibrated if the predicted value matches the observed frequency of the event. In other words, if a model predicts a probability of 80% for a certain class/event, then in the long run, approximately 80% of the instances with this prediction should belong to that class or event.1
Let’s take the problem of churn prediction as an example. In most scenarios, predicting a budget allocation for customer retention is standard in churn prediction. In this case the business cares about how much money to spend on saving potential customer churn.
Let’s imagine that the marketing department had a limited budget and a clear rule:
We only offer a 50 Euro discount voucher if the expected loss from a customer leaving is higher than the cost of the voucher.
In this case the expected loss for each customer is:
\[\text{Expected Loss} = \text{Customer LTV} \times P(\text{churn})\]
where LTV is the customer’s Life Time Value.
An example of flow may look like as follows:
- The model assigns a customer a score of 0.8 i.e. 80% risk
- The business estimates that the life time value of this customer is 100 Euro
- The expected loss is then 100 × 0.8 = 80 Euros
- Send the customer a 50 Euro voucher because the expected loss (80 Euros) is higher than the cost of the voucher (50 Euros)
Here we notice that if the probability is not properly calibrated, the expected loss is over or under estimated which leads to a bad decision. In this case the true frequency of churn for customers with a score of 0.8 might actually be only 0.3.
So in reality, the last two steps become:
- The expected loss after calibration is 100 × 0.3 = 30 Euro
- Do not send the voucher because the expected loss is lower than the cost of the voucher.
In our simple scenario the business wasted 50 Euro on a customer who was unlikely to leave or whose departure would cost less than the cure.
If we multiply this by 100 000 customers, the uncalibrated model despite having high accuracy is actively burning cash.
As we see in both figures, the AUC/ROC did not change much between the two models. However we notice a big gap in the probability distribution.
This is because calibration affects the probability values i.e. the confidence of the model but it often preserves the ranking. We can think of the distribution plot on the right as where the model places its bets.
If you are using cross validation, the calibrated model might show a slight improvement in the AUC. This ranking-versus-calibration tradeoff also appears in ranking systems (Kweon, Kang, and Yu 2022), where calibration matters more than raw ordering.
The reliability diagram shows when the model is under-estimating (the curve falls below the diagonal) or over-estimating (the curve sits above the diagonal). A more common shape here is the S-curve, meaning that the model under-estimates until it reaches some point where it starts to over-estimate, or vice versa.2
The profit curve shows the aggregated net result of the model’s decisions across all possible decision thresholds. The calibrated model should reach its peak close to the optimal threshold while the uncalibrated model should peak before or after. This helps find the best setting to maximize ROI.
There are metrics we can look at to identify these problems during the modeling and exploration phase. One of them is the Expected Calibration Error (ECE) (Nixon et al. 2020). It is roughly the weighted average of gaps between the model’s prediction line in the reliability diagram and the perfect diagonal.
ECE measures how far predicted probabilities (confidence) are from reality (accuracy) (Nixon et al. 2020). You group predictions into probability bins, like 0.8–0.9. For each bin you compare the model’s average confidence to the empirical accuracy in that bin. ECE is the average absolute gap across bins, weighted by how many samples fall in each bin. So an ECE of 0.03 means the model’s stated probabilities are off by about 3 percentage points on average, in this binned, frequency-weighted sense.
If a bin contains predictions around 0.9 confidence but only 0.5 of them are correct, that bin has a 0.4 calibration gap. That only becomes a 0.4 ECE if almost all samples land in that bin. If it is 10% of samples, it contributes 0.1 × 0.4 = 0.04 to ECE.
As we see in the previous example, the bin size and sample size are important to get a reliable ECE. It is also common to have multiple empty bins which can lead to under or over estimation of the real ECE number. However, it still gives a good idea of how the model performs and if it should be calibrated or not. The size of the bins can affect the reliability of the calibration estimate, with larger bins providing more stable estimates but potentially less detail.
The ECE is not a single number, it varies across different bin sizes. Moreover due to the data distribution, certain bins can end up empty or have very few values while a few bins may contain most of the samples. The calibrated model should always be consistently low and flat.
Due to this, Adaptive Calibration Error (ACE) is more useful for our example because the average churn rate is low (Nixon et al. 2020). So if we uniformly bin, many bins are empty or may contain outliers. Instead we can split bins with equal sample size.
Generally ACE is preferred for business use because it is smoother compared to ECE and tends to be more sensitive to miscalibration in sparse regions (Błasiok and Nakkiran 2023). It is good at exposing the specific areas where the model is failing even in low-risk zones.
ACE focuses on reliability. It answers the specific question: when the model says 80%, does it happen 80% of the time? But it ignores the ability to separate classes.
The Brier score is another metric to identify calibration error. It is a Mean Squared Error (MSE) over the probability forecast against the binary outcome. It measures on average how far the model’s confidence is from reality. The lower the score the better. One thing to keep in mind is that the Brier score captures both calibration and refinement (how sharp the probabilities are), so a model can score well by being sharp without being well-calibrated. So we should always pair it with a reliability diagram.3
| Model | Brier Score | Log Loss |
|---|---|---|
| Uncalibrated RF | 0.1470 | 0.4609 |
| Calibrated RF (Isotonic) | 0.1316 | 0.4193 |
Another important one is the log loss. It is the most sensitive metric because it punishes being wrong and overconfident. It focuses on avoiding catastrophic errors.
When using isotonic regression it can predict exactly 0.0 or 1.0 and this may cause the log loss to go to infinity even if it is wrong only once. A known trick for this is to clip the probability to a small range i.e. [1e-7, 1−1e-7] before calculating log loss.4
Calibration is not optional when model probabilities feed into business decisions. Here is what I would recommend:
- Always check the reliability diagram before deploying a model that outputs probabilities. If the curve deviates from the diagonal, your scores are lying to the business.
- Use post-hoc calibration. Isotonic regression works well with enough data. For smaller datasets, Platt scaling (a logistic sigmoid fit) is more stable. Both are one-liners in scikit-learn.
- AUC alone is not enough. A model can rank perfectly and still produce probabilities that are wildly off. If those probabilities drive budget decisions, the ranking does not save you.
- Pick the right calibration metric for your problem. ECE is a good default. ACE is better when your classes are imbalanced. Brier score gives you a single number but mixes calibration with sharpness, so pair it with a reliability diagram.
- Quantify the business impact. Build a profit curve like the one above. Show stakeholders the difference in Euros, not just in abstract metrics. That is how you get buy-in for the extra calibration step.
Further Reading
- (Guo et al. 2017) — showed that modern deep networks are surprisingly miscalibrated despite high accuracy; introduced temperature scaling.
- (Platt 1999) — the original Platt scaling paper; fits a sigmoid to convert SVM outputs into probabilities. Still the go-to for small calibration sets.
- (Nixon et al. 2020) — comprehensive comparison of ECE, ACE, and other calibration metrics with practical guidance on bin selection.
- (Niculescu-Mizil and Caruana 2012) — empirical comparison of calibration across model families (boosting, RF, SVM, Naive Bayes) and post-hoc correction methods.
References
Footnotes
For a foundational treatment of post-hoc calibration methods, see Niculescu-Mizil and Caruana (2012).↩︎
For reliability-diagram construction and interpretation details, see Błasiok and Nakkiran (2023).↩︎
Calibration in high-stakes healthcare settings is examined in Rousseau et al. (2025) and Majlatow et al. (2025).↩︎
Early evidence on calibration effects with boosting and post-hoc correction can be found in Niculescu-Mizil and Caruana (2012).↩︎