Deep Probability Estimation

This website contains results, code, and pre-trained models from Deep Probability Estimation by Sheng Liu*, Aakash Kaku*, Weicheng Zhu*, Matan Leibovich*, Sreyas Mohan*, Boyang Yu, Laure Zanna, Narges Razavian, Carlos Fernandez-Granda [* - Equal Contribution].

What Is Probability Estimation?

Estimating probabilities reliably is of crucial importance in many real-world applications such as weather forecasting, medical prognosis, or collision avoidance in autonomous vehicles. This work investigates how to use deep neural networks to estimate probabilities from high-dimensional data such as climatological radar maps, histopathology images, and dashcam videos.

Probability-estimation models are trained on observed outcomes ( $y_i$ ) (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities ( $p_i$ ) of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the important difference that the main objective at inference is to estimate probabilities ( $\hat{p}$ ) rather than predicting the specific outcome.

Early Learning and Memorization in Probability Estimation

Prediction models based on deep learning are typically trained by minimizing the cross entropy between the model output and the training labels. This cost function is guaranteed to be well calibrated in an infinite-data regime, as illustrated by the figure below (1st column). Unfortunately, in practice, prediction models are trained on finite data. In this case, we observe that neural networks indeed eventually overfit and memorize the observed outcomes completely. Moreover, the estimated probabilities collapse to 0 or 1 (2nd column). However, calibration is preserved during the first stages of training (3rd column), which we call early learning. In our paper we provide a theoretical analysis showing that this is a general phenomenon that occurs even for linear models the dimension of the input data is large (Theorem 4.1 in the paper). Our proposed method exploits the early-learning phenomenon to obtain an improved model that is still well calibrated (4th column).

Proposed Method: Calibrated Probability Estimation (CAPE)

We propose Calibrated Probability Estimation (CaPE). Our starting point is a model obtained via early stopping using validation data on the cross-entropy loss. CaPE is designed to produce a discriminative model that is well calibrated. This is achieved by alternatively minimizing two loss functions: (1) a discrimination loss dependent on the observed binary outcomes, and (2) a calibration loss, which ensures that the output probabilities remain calibrated.

The following figures shows the learning curves of cross-entropy (CE) minimization and CaPE, smoothed with a 5-epoch moving average. After an early-learning stage where both training and validation losses decrease, CE minimization overfits (1st and 2nd column), with disastrous consequences in terms of probability estimation (3rd and 4th column, which show the mean squared error and Kullback Leibler divergence with respect to ground-truth probabilities). In contrast, CaPE prevents overfitting, continuing to improve the model, while maintaining calibration.

Synthetic dataset - Face-Based Risk Prediction

To benchmark probability-estimation methods, we built a synthetic dataset based on UTKFace (Zhang et al., 2017b), containing face images and associated ages. We use the age of the person to assign them a probability of contracting a disease. Then we simulate whether the person actually contracts the illness or not with the assigned probability. We use different functions to map from age to probabilities in order to simulate different realistic scenarios. More detais are available here.

Examples from Face-based risk prediction dataset (Linear scenario: The function used to convert age to a probability is a linear function).

We use the benchmark dataset to compare our proposed approach with existing methods, showing that it outperforms them across different scenarios.

Evaluation metrics

Probability estimation shares similar target labels and network outputs with binary classification. However, classification accuracy is not an appropriate metric for evaluating probability-estimation models due to the inherent uncertainty of the outcomes.

For our synthetic dataset, we have access to the ground-truth probability labels and can use them to evaluate performance. A reasonable metric in this case is the mean squared error ( $\text{MSE}_p$ ) between the estimated probability and the ground truth probability.

In practice, ground-truth probabilities are not available. In that case, traditional forecasting metrics such as Brier score, calibration metrics like ECE, MCE, KS-error, or classification metrics AUC that can be used to evaluate the performance of the model. To determine what metric is more appropriate, we use the synthetic dataset to compare different metrics to the gold-standard $\text{MSE}_p$ that uses ground-truth probabilities. Brier score is found to be highly correlated with $\text{MSE}_p$ , in contrast to the classification metric AUC and the calibration metrics ECE, MCE and KS-Error.

Real-world datasets

We evaluate the proposed method on three probability estimation tasks using real-world data.

Survival of Cancer Patients: Based on the Hematoxylin and Eosin slides of non-small cell lung cancers from The Cancer Genome Atlas Program (TCGA), we estimate the 5-year survival probability of cancer patients. See here for more details.
Weather Forecasting: We use the German Weather service dataset, which contains quality-controlled rainfall-depth composites from 17 operational Doppler radars. We use 30 minutes of precipitation data to predict if the mean precipitation over the area covered will increase or decrease one hour after the most recent measurement. Three precipitation maps from the past 30 minutes serve as an input. See here for details.
Collision Prediction: We use 0.3 seconds of real dashcam videos from the YouTubeCrash dataset as input, and predict the probability of a collision in the next 2 seconds.

On all the three real-world datasets, CaPE outperforms the existing calibration approaches (when compared using the Brier score which was found to capture the probability estimation performance in the absence of the ground truth probabilities)

In addition, the following reliability diagrams show that CaPE produces well calibrated probabilities for the three real-world datasets.

Video presentation

Slides

Pre-Trained Models and Code

Please visit our GitHub page for data, pre-trained models, code, and instructions on how to use the code.