神经网络不确定性综述(Part V)——Uncertainty measures and quality

4. Uncertainty measures and quality

不确定性的数值测量与质量评估的方法。

Uncertainty estimation的质量取决于所使用的方法。比如different approximations of Bayesian inference (e.g. Gaussian and Laplace approximations) 会产生不同的不确定性估计结果。
Uncertainty缺少金标准。比如，如果我们将uncertainty定义为the uncertainty across human subjects，we still have to answer questions as “How many subjects do we need?” or “How to choose the subjects?”
缺乏统一的定量评价指标。在不同的机器学习任务中，不确定性的定义也是不同的(Huang et al . 2019b)。比如在回归任务重，prediction intervals or standard deviation被用来表示uncertainty；而在分类/分割任务中，可以使用entropy作为捕获uncertainty的指标。

4.1 Evaluating uncertainty in classification tasks

For classification tasks, the network’s softmax output already represents a measure of confidence. But since the raw softmax output is neither very reliable (Hendrycks and Gimpel 2017) nor can it represent all sources of uncertainty (Smith and Gal 2018), further approaches and corresponding measures were developed.

4.1.1 Measuring data uncertainty in classification tasks

In order to evaluate the amount of predicted data uncertainty, one can for example apply the maximal class probability or the entropy measures:

$\text{Maximal probability: }p_{max}=max\{p_k\}_{k=1}^K\\\text{Entropy: H}(p)=-\sum_{k=1}^Kp_k\log_2(p_k)_\omega$

The maximal probability represents a direct representation of certainty, while entropy describes the average level of information in a random variable. Even though a softmax output should represent the data uncertainty, one cannot tell from a single prediction how large the amount of model uncertainty is that affects this specific prediction as well.

4.1.2 Measuring model uncertainty in classification tasks

As already discussed in Sect. 3, a single softmax prediction is not a very reliable way for uncertainty quantification since it is often badly calibrated (Smith and Gal 2018) and does not have any information about the certainty of the model itself has on this specific output (Smith and Gal 2018).——仅仅通过softmax的结果并不能可靠地去量化uncertainty，并且也不包含有关模型本身对预测output的certainty的任何信息。

An (approximated) posterior distribution $p(\theta|D)$ on the learned model parameters can help to receive better uncertainty estimates.有了这个后验分布，the softmax output本身也成为了一个随机变量，并且可以进一步地评估它的variation，即uncertainty。

For simplicity, we denote $p(y|\theta,x)$ also as $p$ and it will be clear from context whether $p$ depends on $\theta$ or not. 而衡量 $p$ 是否与 $\theta$ 有关的最常见的方式有mutual information (MI), the expected Kullback–Leibler Divergence (EKL), and the predictive variance. Basically, all these measures compute the expected divergence between the (stochastic) softmax output and the expected softmax output

$\hat{p}=\mathbb{E}_{\theta\sim p(\theta|D)}[p(y|x,\theta]$

Mutual Information

The MI uses entropy to measure the mutual dependence between two variables. In the described case, the difference between the information given in the expected softmax output and the expected information in the softmax output is compared, i.e.

$\mathrm{MI}(\theta,y|x,D)=\mathrm{H}[\hat{p}]-\mathbb{E}_{\theta\sim p(\theta|D)}\mathrm{H}[p(y|x,\theta)]$

Smith and Gal (2018) pointed out that the MI is minimal when the knowledge about model parameters does not increase the information in the final prediction. Therefore, the MI can be interpreted as a measure of model uncertainty.

The Expected KL

The Kullback–Leibler divergence measures the divergence between two given probability distributions. The EKL can be used to measure the (expected) divergence among the possible softmax outputs,

$\mathbb{E}_{\theta\sim p(\theta|D)}[KL(\hat{p}\parallel p)]=\mathbb{E}_{\theta\sim p(\theta|D)}\left[\sum_{i=1}^K\hat{p}_i\log\left(\frac{\hat{p}_i}{p_i}\right)\right]$

which can also be interpreted as a measure of uncertainty on the model’s output and therefore represents the model uncertainty.

The predictive variance

The predictive variance evaluates the variance on the (random) softmax outputs, i.e.

$\sigma(p)=\mathbb{E}_{\theta\sim p(\theta|D)}[(p-\hat{p})^2]$

如何估计 $\hat{p}$ ?

$\hat{p}\approx\frac1M\sum_{i=1}^Mp^i$

4.1.3 Measuring distributional uncertainty in classification tasks

尽管以上的uncertainty measures被广泛用于捕获BNN、ensemble methods以及test-time augmentation得到的多种prediction之间的多样性，但是它们无法捕获输入数据或OOD样本中的distributional shifts。

考虑这样一个场景，如果所有的predictor都分配了high probability mass给错误的类别标签，这将使不同的预测之间趋于统一，这时网络对预测似乎是certain的，预测本身的uncertainty将变得很低。对于OOD样本，可能的解决方案是使用EDL或者直接关注网络输出的logit。如果网络对于当前样本在任一类别上的质量分配/logit都较低，则该样本趋向于是OOD的。

4.2 Evaluating uncertainty in regression tasks

4.2.1 Measuring data uncertainty in regression predictions

在分类任务中，网络将输出所有可能类别的概率分布。而与此不同，回归任务只做逐点的估计/预测，没有任何数据不确定性的信息。如Section 3所述，一个常见的解决方法是让网络去预测概率分布的参数，比如正态分布的mean vector $\mu$ 与standard deviation $\sigma$ ，之后我们就可以直接用它们来表示data uncertainty。

The prediction of the standard deviation allows an analytical description that the (unknown) true value is within a specific region. The interval that covers the true value with a probability of $\alpha$ (under the assumption that the predicted distribution is correct) is given by (需要确认)

$\left[\widehat{y}-\frac12\Phi^{-1}(\alpha)\cdot\sigma;\widehat{y}+\frac12\Phi^{-1}(\alpha)\cdot\sigma\right]$

where $\Phi^{-1}$ is the quantile function, the inverse of the cumulative probability function.

除此之外，一些工作还提出了直接预测所谓的prediction interval (PI)的方法，

$PI(x)=[B_l,B_u]$

这个区间提供了预测值的可取范围(均匀分布)，而这种方法的certainty则可以直接通过测量此区间的长度来衡量。这里介绍两个指标，

Mean Prediction Interval Width (MPIW)
Prediction Interval Coverage Probability (PICP)

The PICP represents the percentage of test predictions that fall into a prediction interval and is defined as

$\mathrm{PICP}=\frac cn$

where $n$ is the total number of predictions and $c$ is the number of ground truth values that are actually captured by the predicted intervals.

4.2.2 Measuring model uncertainty in regression predictions

回归任务和分类任务的model uncertainty之间没有差异，可以使用类似的方法测量。比如，大部分情况下可以通过approximate an average prediction and measure the divergence among the single predictions.

4.3 Evaluating uncertainty in segmentation tasks

分割任务中的不确定性评估与分类问题十分相似，比如using approximates of Bayesian inference (Nair et al. 2020; Roy et al. 2019; LaBonte et al. 2019; Eaton-Rosen et al. 2018; McClure et al. 2019; Soleimany et al. 2019; Soberanis-Mukul et al. 2020; Seebock et al. 2020) 或者test-time augmentation(Wang et al. 2019).

In the context of segmentation, the uncertainty in pixel-wise segmentation is measured using confidence intervals (LaBonte et al. 2019; Eaton-Rosen et al. 2018), the predictive variance (Soleimany et al. 2019; Seebock et al. 2020), the predictive entropy (Roy et al. 2019; Wang et al. 2019; McClure et al. 2019; Soberanis-Mukul et al. 2020) or the mutual information (Nair et al. 2020).

The uncertainty in structure (volume) estimation is obtained by averaging over all pixel-wise uncertainty estimates (Seebock et al. 2020; McClure et al. 2019). The quality of volume uncertainties is assessed by evaluating the coefficient of variation, the average Dice score, or the intersection over union (Roy et al. 2019; Wang et al. 2019).

以上两段提到的这几篇文献可以读一下