# Know what you predict: estimating uncertainty with neural networks

In most of the practical use cases, data scientists are satisfied by machine learning models that simply make predictions. Given unseen observations $x$, a model performs the prediction of certain outcome $\hat{y}$. The performance of such a model is usually assessed by comparing the predicted value with the ground truth, whenever available. There is however, another measure that might be of interest: uncertainty. How uncertain is a model in predicting a particular sample $x$?

As a matter of fact, there are some aspects of a dataset that are learned faster and better with respect to some others. The geometry of a training dataset can differ consistently within the dataset itself. As a consequence, looking at the mere predicted value of a model is not only reductive but can also be misleading. For instance, knowing that a model is quite uncertain about a specific input data can improve the overall decision of an agent (artificial or human) who relies on such prediction. Statisticians know the problem of uncertainty ever since.

To be more specific, Bayesian statisticians know that their framework is solid enough to natively reason about  uncertainty. A Bayesian neural network, for instance, is characterized by weights that are statistical distributions rather than constant values (like in the more traditional case). In such a framework the predicted value is also a statistical distribution of which it is possible to calculate statistical properties like means, standard deviations, function shapes and densities.

Unfortunately, Bayesian neural networks (and other Bayesian models in general) are slow to train, much slower than non-Bayesian models, due to sampling that can be prohibitive for real world problems with high number of dimensions.
In this post I describe an alternative method to calculate model uncertainty by using a relatively deep neural network. A similar approach can be generalized to other types of models too.

Very often, especially in classification, predictive probabilities obtained at the end of the pipeline (the softmax output) are interpreted as model confidence. This is a common misinterpretation that can lead to inconsistencies, due to the fact that a model can be uncertain in its predictions even with a high softmax output.

The theoretical framework proposed by Yarin Gal et al. leverages model dropout as a tool to evaluate how uncertain a model is with respect to a certain observed sample.
Model dropout (specific to the neural network case) consists in randomly disconnecting some connections in the layers of a neural network. It is usually applied as a mechanism to increase generalization and mitigate over-fitting.
By randomly switching off some connections a model tends to unlearn some patterns that are only specific of the training set due to the presence of outliers or noise. Such patterns are usually not present in the testing set, leading the model to under-perform and not generalize with unseen observations.
In deep learning, practitioners set dropout rates between 20% and 50% range, meaning that up to half of the connections of a certain layer will be set to zero during inference. This mechanism forces the back-propagation method to tune parameters in the other layers such that the whole network can cope with the dropped rate of connections. The key finding consists in the fact that a neural network with dropout applied before every weight layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process.
Such a dropout minimizes the Kullback-Leibler divergence between an approximate distribution and the posterior of a deep Gaussian process.
Below there is a theoretical explanation that leads to the aforementioned approximation. The paragraph below can be skipped if only a practical implementation is needed.

### What is a Gaussian process and does it provide a measure of uncertainty?

A deep Gaussian process is a statistical tool that allows one to model distributions over functions. For instance, given the covariance function $K(x,y)&space;=&space;\int&space;p(w)&space;p(b)&space;\sigma(w\top&space;x&space;+&space;b)&space;\sigma(w\top&space;y&space;+&space;b)&space;dw&space;db$ with non-linear elements $\sigma()$ and distributions $p(w)$ and $p(b)$, where $w$ and $b$ are the weights and the bias vectors respectively, it is shown that a deep Gaussian process with L layers and co-variance function $K(x,&space;y)$ can be approximated by a variational distribution over each component of the spectral decomposition of the GPs’ covariance functions.
If $W_i$ is a random matrix for each layer $i$ and we let each row of $W_i$ distribute according to the $p(w)$ above, we can claim that $p(y|x,&space;X,&space;Y)&space;=&space;\int&space;p(y|x,&space;w)&space;p(w|X,&space;Y)dw$  and $p(y|x,&space;X,&space;Y)&space;=&space;N(y;&space;\hat{y}(x,w),&space;\tau&space;I_D),&space;\hat{y}(x,w={W_1\dotsW_L})$ Since the posterior distribution $p(w|X,Y)$ is usually intractable, an approximate distribution $q(w)$ is used.
Such $q(w)$ is obtained by randomly setting matrices’ columns to zero, by a Bernoulli distribution.

Hence, $W_i=M_i&space;\diag([z_i,&space;j]&space;K_i&space;)&space;j=1$ $z_{i,j}&space;\sim&space;Bernoulli(p_i),&space;\&space;with\&space;i=1\dots&space;L&space;\&space;and\&space;j&space;=&space;1\dots&space;K_{i-1}$ given some probabilities $p_i$ and matrices $M_i$ as variational parameters.
If $z_{i,j}&space;=&space;0$ then the unit $j$ in layer $i-1$ is dropped out. Almost there.

The approximate predictive distribution is given by $q\left(y|x\right)=p\left(y|x,w\right)q\left(w\right)dw$,
with $w={W}_{i}{L}_{i}=1$ is the set of random variables for a model with L layers.

By sampling realizations from the Bernoulli distribution ${z}_{1}\dots {z}_{L}$ it is possible to estimate $E\left({y}^{*}|{x}^{*}\right)\sim \frac{1}{N}\sum _{n\in N}\stackrel{^}{{y}^{*}}\left({x}^{*},{W}_{1}\dots {W}_{L}\right)$,  where N is the number of realizations.
Essentially the estimate is performed via Monte Carlo (and the method is usually referred to as MC dropout).

### How to implement uncertainty-aware neural networks

The code below implements a stacked LSTM neural network with training dropout layers in Keras.
When in inference mode, this model will randomly configure the dropout layers and perform the forward propagation. In order to have stable results, the network should perform the prediction on the same input multiple times. This number depends on the dropout rate, and can vary between 50 and 100.

It turns out that the mean of all predictions is a good approximation of the final prediction, while the standard deviation (of all predictions) is a good measure of the uncertainty of the model for that particular input.

This approach is very similar to the one that leverages ensemble methods that perform independent predictions for the same input. In fact this is an ensemble model with shared parameters.
One challenge of this method consists in empirically estimating the dropout rate, that depends on the geometry of the training data. A low dropout rate might lead to models that have very little difference in their configurations, giving very similar results and low variance. Such a result might be misinterpreted as a model that shows high confidence with the prediction (small variance, hence small uncertainty). In contrast, a dropout rate that is too high might lead to inaccurate predictions due to the low number of available connections/neurons during back-propagation.

As in many other models, the dropout rate should be chosen appropriately by trial and error, possibly with known datasets first.