Entropy in machine learning

nasa universe

It all starts from physics.

The entropy of an isolated system never decreases… Everyone at school, at some point of his life, learned this in his physics class. That’s the second law of thermodynamics, which is the only low of physics that requires a particular direction for time, also called arrow of time.

Despite being an abstract concept, everyone has an intuitive sense of the effects of entropy and the arrow of time. For instance, imagine to watch the video depicting a glass that falls from a table and breaks in many pieces. Played in reverse, it would show all the pieces that reassemble themselves back into a glass from the ground to the top of the table. Intuitively, we identify that only when played forwards the video makes sense and matches our expectations. That is when the entropy of the scene is increasing.
Now imagine to fill a tank with two different types of gas, say gas A and gas B. Because of the second law of thermodynamics, both gases reach an equilibrium after which they are so mixed up that it is difficult if not impossible to distinguish each molecule from each type of gas.

On the contrary, in a low entropy tank, particles of gas A would be gathered in a specific area, being perfectly distinguishable from the particles of gas B.

By the second law of the thermodynamics, the entropy of an isolated system never decreases.

It should be clear now that entropy is in fact a measure of the disorder of a physical system, and it is related to the uncertainty associated with its physical state.

A similar concept of uncertainty and lack of knowledge can be applied to machine learning. In fact, entropy can also be interpreted as a measure of the expected amount of information. Such a concept plays a key role in machine learning, where it is referred to as Shannon entropy.


Shannon entropy

When estimating a model from the data, one has to assume a certain data generating process. The parameters of such a model will be calculated as the values that maximize the agreement between the model and the observed data. This goes under the name of maximum likelihood principle. By doing so, we assume that the quantities we are predicting are in fact  random variables.

Random variables are quantities whose value is uncertain, because they can either have any values within a continuous range (if they are continuous) or take specific values with certain probabilities (in case of discrete variables).
When an unlikely outcome of a random variable is observed, one would be somehow surprised about it. In technical jargon, we associate that observed value with high surprisal, also referred to as self-information [1]. The reason behind this is based on the fact that by observing such a value one gains additional information about the behavior of a random variable.

Here is an example. Consider a heavily biased coin that always lands on heads. Obviously, the outcome of any coin toss is fully predictable, thus one should never be surprised about its outcome. This in turn means that there is zero information from this experiment, hence its self-information is zero. In the case of unbiased coin where the probability of head is 0.5, the result of the coin toss is totally unpredictable. In such scenario, the self-information is at its maximum.

The entropy of a Bernoulli trial (coin toss) as a function of success probability.

What does all this have to do with the concept of entropy?
Shannon entropy is defined as the expected value of the self information I of a random variable. In case of a discrete random variable, this is just the weighted sum of the self-information associated to each possible outcome, where the weights are the corresponding probabilities of occurrence:

H(X) = \sum_x p(x)\cdot I(x) = -\sum_x p(x)\cdot \log p(x)

For a continuous random variable, the sum is usually replaced by an integral and we have:

\inline H(X) = -\int _x p(x) \cdot\log p(x) \cdot dx.


In this case, variables characterized by a broad distribution have larger entropy compared to variables within a smaller range. From a pair of discrete random variables, X and Y, one can define other entropy measures, such as the ones below:

  • Joint entropy. Entropy of two random variables occurring together
    \inline H(X,Y) = - \sum_x \sum_y p(x,y) \cdot \log p(x,y)
  • Conditional entropy. Entropy of a random variable conditional to the knowledge of another one
    \inline H(Y|X) = -\sum_x \sum_y p(x,y) \cdot \log p(y|x)
    Joint and Conditional entropy are in fact related such that \inline H(X,Y) = H(X) + H(Y|X)
  • Relative entropy. Also known as Kullback–Leibler (KL) divergence or information gain, it measures the distance between two probability mass functions \inline p(x) and \inline q(x), and it is defined as
    \inline D(p(x)||q(x)) = \sum_x p(x)\cdot \log \frac{p(x)}{q(x)}
  • Cross-entropy. It is the expected self-information of a random variable characterized by discrete probability distribution \inline q(x), where the expectation is taken with respect to the probability mass function \inline p(x) of another random variable: \inline H(p(x),q(x)) = -\sum_x p(x)\cdot \log q(x).

As before, in case of continuous variables, the summations are replaced by integrals.

The intriguing fact about entropy is that it has several applications in machine learning. Curious about that?
Let’s see some specific use cases!

Machine learning applications

Exploratory data analysis (EDA)

Statistician John Tukey used to say: “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone”. His statement highlights the importance of organizing, plotting, and summarizing a data set before starting any modeling. This initial exploration allows one to gain intuition about interesting patterns, and definitely guides the next modeling phase.
For instance, suppose one is given a data set comprised of many input variables. The first thing one would like to understand is how such features were related to each other.

Within an EDA context, one would compute the Pearson correlation between each pair of variables and apply a relatively simple reasoning:  variables with high positive or negative correlation should be interpreted as being related to each other. One drawback of the Pearson correlation is that it is a linear measure, that is, it measures the linear dependencies between any two variables.
What if a pair of features is not correlated at all? Does it mean that there isn’t any relationship between them? Certainly not! In order to assess the existence of a non-linear dependence, one can rely upon the mutual information.

Mutual information measures the amount of information that one random variable contains about another one. In other words, it is the reduction in the uncertainty of one variable due to the knowledge of the other. Since it does not assume any property of the dependence between variables, such as linearity or continuity, it can detect dependencies that would otherwise be invisible to the Pearson correlation. Technically, mutual information (MI) is defined as the relative entropy between the joint distribution, \inline p(x,y), and the product distribution \inline p(x) \cdot p(y)

The equation below should clarify this point:

\small H(X) - H(X|Y) = H(X) + H(Y) - H(X,Y) = \sum_x \sum_y p(x,y) \cdot \log \frac{p(x,y)}{p(x)\cdot p(y)}


If all this sounds too abstract,  let’s see how one can compute the mutual information (MI) matrix in Python.
Like the correlation matrix, the MI matrix contains all pairwise mutual information values between the input variables.

In the code below, we make use of a standard implementation of mutual information provided by the Python library sklearn.  For a more precise calculation of entropy and mutual information values, refer to [4].

In many scenarios, calculating the dependency between variables via their mutual information is more beneficial than the simple correlation. The snippet above provides an implementation of the mutual information from the sklearn library, applied to a public data set for breast cancer detection.

Feature selection

Feature selection has a crucial role in determining the success of a classification or regression algorithm, especially when the data set under investigation is characterized by a high number of variables. Some variables may be highly correlated, some others may just carry noise and very little signal.

Moreover, higher dimensional vectors look more and more similar as their dimensional space is increased. This phenomenon is known as the curse of dimensionality, and it is very detrimental for algorithms based on computing the distance between observations. In this case, feature selection can be very helpful as it allows to detect relevant features and discard the irrelevant ones. This usually leads to several advantages, such as improved generalization of a learning algorithm, data understanding, reduction of memory and CPU requirements.

Suppose one is given a binary classification problem and she would like to perform feature selection. One viable way would be to compute the KL divergence between the distribution of the first class (label) and the distribution of the other class, for every variable in the data set. High informative features will be the ones with high KL divergence. Such variables better distinguish the observations in the different classes.
Let’s see how one can implement such a concept in Python.

The snippet above implements the KL divergence between two variables from the same breast cancer data set, via the scipy library.


Subset extraction

Another use case where entropy can play a role is subset extraction or sampling. Splitting the original data into training and testing is extremely important and can determine the accuracy and reliability of any machine learning model.
As a matter of fact the training set should be a good representation of the overall distribution of the original data set. After all, this is the only data used to train a machine learning model that will generalize to new unseen observations.
An essential requirement is that training and unseen observations follow the same statistical distribution. Most of the times random sampling works just fine. But there are cases in which a hidden bias in the original data set might lead to inaccurate sampling.

One way to sample data in such difficult scenarios would be by considering the entropy. A sample with entropy similar to the one of the original data set is a good conservative strategy that will not surprise any machine learning model during training and inference. In fact,  the aforementioned sample would retain all the characteristics of the entire data set.



Entropy plays a fundamental role during model design too. Here is how:

Some important concepts that can be encountered in all deep learning and machine learning libraries, rely upon the concepts explained in this post. Below a list of the most common ones.


  • Decision tree learning. A decision tree is a (classification or regression) model based on a set of binary decisions involving the various features that are present in the data matrix. It consists of a hierarchy of nodes, where each node represents a question relative to a given feature, e.g. “Is temperature less than 20 °C? If yes, assign the current observation to class 0, otherwise to class 1″. Under this framework, every node splits the set of observations in two subsets. Building a tree consists in deciding which features to split on and assessing its splitting point (20 °C in the example above).
    Entropy can be used to perform the perfect split. At each stage of such a recursive partitioning, the splitting point that minimizes the combined entropy of the resulting subsets would do the trick.
  • Cross-entropy loss. Every supervised learning algorithm is based on a specific optimization problem, where one tries to find the parameters that minimize the error the model makes. In a binary classification task, like the task of distinguishing spam from non-spam e-mails, the error of an hypothetical model could be computed by comparing the predicted probabilities (of a sample to belong to any particular class) to the true class label.
    In the jargon, this strategy goes under the name of average cross-entropy.
    Given N samples, the predicted probability for the i-th sample as \inline \small \hat{y}_i and the observed probability as \inline \small y_i, the cross-entropy cost function would be \inline \small -\frac{1}{N}\sum_{i=1}^N y_i \cdot \log(\hat{y}_i) + (1-y_i)\cdot \log(1 - \hat{y}_i).
    According to this formula, mis-classifications are penalized in proportion to their higher probability.
  • Comparing partitions. The lack of labels that make a machine learning problem unsupervised, forces a data scientist to discover the hidden geometry of the data. Clustering algorithms are one common solution to such problem. With clustering, a dataset is essentially split into subsets, such that the elements of one subset are similar to the other elements of the same subset. In contrast, elements from different subsets are supposed to be dissimilar.
    How sensitive is a clustering algorithm to small perturbations? Will small changes in the data result in dramatic changes in the produced partitions? Moreover, how similar are the solutions provided by two different algorithms?
    How accurate would the optimal solution be?
    All these questions can be answered by measuring the similarity of different clustering results [3]. Computing the mutual information between the generated partitions is one effective approach to use in this context. In fact, groupings with high mutual information will be more similar to each other. Once more, entropy seems to be an effective strategy.




[1] https://medium.com/swlh/shannon-entropy-in-the-context-of-machine-learning-and-ai-24aee2709e32

[2] J. A. K. Suykens, et al., “Least Squares Support Vector Machines”, Singapore, World Scientific, 2002.

[3]  S. Wagner, et al., “Comparing Clusterings – An Overview”, Technical report, Karlsruher Institut fur Technologie, 2007.

[4] R. A. A. Ince, et al., “Python for Information Theoretic Analysis of Neural Data”, Frontiers in Neuroinformatics 3:4, 2009.

Subscribe to our Newsletter