# Entropy in machine learning

### It all starts from physics

The entropy of an isolated system never decreases…

Everyone at school, at some point of his life, learned this in his physics class. That’s the second law of thermodynamics, which is the only low of physics that requires a particular direction for time, also called arrow of time.
Despite being an abstract concept, everyone has an intuitive sense of the effects of entropy and the arrow of time. For instance, imagine to watch the video depicting a glass that falls from a table and breaks in many pieces. Played in reverse, it would show all the pieces that reassemble themselves back into a glass from the ground to the top of the table. Intuitively, we identify that only when played forwards the video makes sense and matches our expectations. That is when the entropy of the scene is increasing.

### A definition of entropy

Now imagine to fill a tank with two different types of gas, say gas A and gas B. Because of the second law of thermodynamics, both gases reach an equilibrium after which they are so mixed up that it is difficult if not impossible to distinguish each molecule from each type of gas.

On the contrary, in a low entropy tank, particles of gas A would gather in a specific area, being perfectly distinguishable from the particles of gas B.

It should be clear now that entropy is in fact a measure of the disorder of a physical system, and it is related to the uncertainty associated with its physical state. This is the concept behind Boltzmann’s formula for entropy: $S=k_B&space;\cdot&space;ln(W)$
The entropy S grows as the number of possible states in the system W grows.  In plain English, S grows as disorder grows.

A similar concept of uncertainty and lack of knowledge also exists to machine learning. In fact, entropy is also  a measure of the expected amount of information. Such a concept plays a key role in machine learning, where it is referred to as Shannon entropy. Before describing entropy in machine learning let’s introduce this other important concept.

## Shannon entropy

When estimating a model from the data, one has to assume a certain data generating process. The parameters of such a model are the values that maximize the agreement between the model and the observed data. This goes under the name of maximum likelihood principle. By doing so, we assume that the quantities we are predicting are in fact  random variables.

Random variables are quantities whose value is uncertain, because they can either have any values within a continuous range (if they are continuous) or take specific values with certain probabilities (in case of discrete variables).
When you observe an unlikely outcome of a random variable, you would be somehow surprised about it. In technical jargon, we associate that observed value with high surprisal, also referred to as self-information [1]. The reason behind this is simple: by observing such a value one gains additional information about the behavior of a random variable.

### A practical example

In order to understand the importance of entropy in machine learning, consider a heavily biased coin that always lands on heads. Obviously, the outcome of any coin toss is fully predictable, thus its outcome should never surprise you. This in turn means that there is zero information from this experiment, hence its self-information is zero. In the case of unbiased coin where the probability of head is 0.5, the result of the coin toss is totally unpredictable. In such scenario, the self-information is at its maximum.

### Some useful definitions

What does all this have to do with the concept of entropy?
Shannon entropy is the expected value of the self information I of a random variable. In case of a discrete random variable, this is just the weighted sum of the self-information associated to each possible outcome, where the weights are the corresponding probabilities of occurrence (We shall use H to denote Shannon entropy).

$H(X)&space;=&space;\sum_x&space;p(x)\cdot&space;I(x)&space;=&space;-\sum_x&space;p(x)\cdot&space;\log&space;p(x)$

For a continuous random variable, the sum is usually replaced by an integral and we have:

$\inline&space;H(X)&space;=&space;-\int&space;_x&space;p(x)&space;\cdot\log&space;p(x)&space;\cdot&space;dx$.

In this case, variables characterized by a broad distribution have larger entropy compared to variables within a smaller range. From a pair of discrete random variables, X and Y, one can define other entropy measures, such as the ones below:

• Joint entropy. Entropy of two random variables occurring together
$\inline&space;H(X,Y)&space;=&space;-&space;\sum_x&space;\sum_y&space;p(x,y)&space;\cdot&space;\log&space;p(x,y)$
• Conditional entropy. Entropy of a random variable conditional to the knowledge of another one
$\inline&space;H(Y|X)&space;=&space;-\sum_x&space;\sum_y&space;p(x,y)&space;\cdot&space;\log&space;p(y|x)$
Joint and Conditional entropy are in fact related such that $\inline&space;H(X,Y)&space;=&space;H(X)&space;+&space;H(Y|X)$
• Relative entropy. Also known as Kullback–Leibler (KL) divergence or information gain, it measures the distance between two probability mass functions $\inline&space;p(x)$ and $\inline&space;q(x)$, and its definition is
$\inline&space;D(p(x)||q(x))&space;=&space;\sum_x&space;p(x)\cdot&space;\log&space;\frac{p(x)}{q(x)}$
• Cross-entropy. It is the expected self-information of a random variable characterized by discrete probability distribution $\inline&space;q(x)$, where the expectation is taken with respect to the probability mass function $\inline&space;p(x)$ of another random variable: $\inline&space;H(p(x),q(x))&space;=&space;-\sum_x&space;p(x)\cdot&space;\log&space;q(x)$.

As before, in case of continuous variables, integrals replace summations. The intriguing fact about entropy is that it has several applications in machine learning. Curious about that?
Let’s see some use cases!

## Machine learning applications

If you made it until here, you are ready to grasp the real power of entropy in machine learning. Below are some interesting applications that take advantage of such a powerful tool.

### Exploratory data analysis (EDA)

Statistician John Tukey used to say: “Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone”. His statement highlights the importance of organizing, plotting, and summarizing a data set before starting any modeling. This initial exploration allows one to gain intuition about interesting patterns, and definitely guides the next modeling phase.
For instance, suppose you have a data set comprised of many input variables. The first thing you would like to understand is how such features relate to each other.

Within an EDA context, you would compute the Pearson correlation between each pair of variables and apply a relatively simple reasoning:  variables with high positive or negative correlation should be interpreted as being related to each other. One drawback of the Pearson correlation is that it is a linear measure, that is, it measures the linear dependencies between any two variables.
What if a pair of features has no relations at all? Does it mean that there isn’t any relationship between them? Certainly not! In order to assess the existence of a non-linear dependence, you can rely upon the mutual information.

Mutual information measures the amount of information that one random variable contains about another one. In other words, it is the reduction in the uncertainty of one variable due to the knowledge of the other. Since it does not assume any property of the dependence between variables, such as linearity or continuity, it can detect dependencies that would otherwise be invisible to the Pearson correlation. Technically, mutual information (MI) is the relative entropy between the joint distribution, $\inline&space;p(x,y)$, and the product distribution $\inline&space;p(x)&space;\cdot&space;p(y)$

The equation below clarifies this point:

$\small&space;H(X)&space;-&space;H(X|Y)&space;=&space;H(X)&space;+&space;H(Y)&space;-&space;H(X,Y)&space;=&space;\sum_x&space;\sum_y&space;p(x,y)&space;\cdot&space;\log&space;\frac{p(x,y)}{p(x)\cdot&space;p(y)}$

### Computing the mutual information (MI) matrix

Let’s see how you can compute the mutual information (MI) matrix in Python.
Like the correlation matrix, the MI matrix contains all pairwise mutual information values between the input variables.

In the code below, we make use of a standard implementation of mutual information provided by the Python library sklearn.  For a more precise calculation of entropy and mutual information values, refer to [4].

In many scenarios, calculating the dependency between variables via their mutual information is more beneficial than the simple correlation. The snippet above provides an implementation of the mutual information from the sklearn library, applied to a public data set for breast cancer detection.

### Feature selection

Feature selection has a crucial role in determining the success of a classification or regression algorithm, especially when the data set under investigation has a high number of variables. Some variables may be highly correlated, some others may just carry noise and very little signal.

Moreover, higher dimensional vectors look more and more similar as their dimensional space increases. This phenomenon, known as the curse of dimensionality, is very detrimental for algorithms based on computing the distance between observations. In this case, feature selection can be very helpful as it allows to detect relevant features and discard the irrelevant ones. This usually leads to several advantages, such as improved generalization of a learning algorithm, data understanding, reduction of memory and CPU requirements.

Suppose you have binary classification problem and you’d like to perform feature selection. One viable way would be to compute the KL divergence between the distribution of the first class (label) and the distribution of the other class, for every variable in the data set. High informative features will be the ones with high KL divergence. Such variables better distinguish the observations in the different classes.
Let’s see how you can implement such a concept in Python.

The snippet above implements the KL divergence between two variables from the same breast cancer data set, via the scipy library.

### Subset extraction

Another use case where entropy can play a role is subset extraction or sampling. Splitting the original data into training and testing is extremely important and can determine the accuracy and reliability of any machine learning model.
As a matter of fact the training set should be a good representation of the overall distribution of the original data set. After all, this is the only data used to train a machine learning model that will generalize to new unseen observations.
An essential requirement is that training and unseen observations follow the same statistical distribution. Most of the times random sampling works just fine. But there are cases in which a hidden bias in the original data set might lead to inaccurate sampling.

One way to sample data in such difficult scenarios would be by considering the entropy. A sample with entropy similar to the one of the original data set is a good conservative strategy that will not surprise any machine learning model during training and inference. In fact,  the aforementioned sample would retain all the characteristics of the entire data set.

### Modeling

Entropy plays a fundamental role during model design too. Here is how.
Some important concepts in  machine learning libraries rely upon the concepts explained in this post. Below  is a discussion of the most common ones.

#### Decision tree learning.

A decision tree is a (classification or regression) model based on a set of binary decisions involving the various features that are present in the data matrix. It consists of a hierarchy of nodes, where each node represents a question relative to a given feature, e.g. “Is temperature less than 20 °C? If yes, assign the current observation to class 0, otherwise to class 1″. Under this framework, every node splits the set of observations in two subsets. Building a tree consists in deciding which features to split on and assessing its splitting point (20 °C in the example above).
Entropy can be used to perform the perfect split. At each stage of such a recursive partitioning, the splitting point that minimizes the combined entropy of the resulting subsets would do the trick.

#### Cross-entropy loss.

Every supervised learning algorithm relies on a specific optimization strategy, where one tries to find the parameters that minimize the error the model makes. In a binary classification task, like the task of distinguishing spam from non-spam e-mails,  you could compute the error of an hypothetical model by comparing the predicted probabilities (of a sample to belong to any particular class) to the true class label.
In the jargon, this strategy goes under the name of average cross-entropy.
Given N samples, the predicted probability for the i-th sample as $\inline&space;\small&space;\hat{y}_i$ and the observed probability as $\inline&space;\small&space;y_i$, the cross-entropy cost function would be $\inline&space;\small&space;-\frac{1}{N}\sum_{i=1}^N&space;y_i&space;\cdot&space;\log(\hat{y}_i)&space;+&space;(1-y_i)\cdot&space;\log(1&space;-&space;\hat{y}_i)$.
According to this formula, mis-classifications are penalized in proportion to their higher probability.

#### Comparing partitions.

The lack of labels that make a machine learning problem unsupervised, forces a data scientist to discover the hidden geometry of the data. Clustering algorithms are one common solution to such problem. With clustering, a dataset is essentially split into subsets, such that the elements of one subset are similar to the other elements of the same subset. In contrast, elements from different subsets are supposed to be dissimilar.

### Sensitivity Analysis

How sensitive is a clustering algorithm to small perturbations? Will small changes in the data result in dramatic changes in the produced partitions? Moreover, how similar are the solutions provided by two different algorithms?
How accurate would the optimal solution be?
Measuring the similarity of different clustering results provides an answer to all these questions [3].
Computing the mutual information between the generated partitions is one effective approach to use in this context. In fact, groupings with high mutual information will be more similar to each other. Once more, entropy seems to be an effective strategy.

## References

[1] https://medium.com/swlh/shannon-entropy-in-the-context-of-machine-learning-and-ai-24aee2709e32

[2] J. A. K. Suykens, et al., “Least Squares Support Vector Machines”, Singapore, World Scientific, 2002.

[3]  S. Wagner, et al., “Comparing Clusterings – An Overview”, Technical report, Karlsruher Institut fur Technologie, 2007.

[4] R. A. A. Ince, et al., “Python for Information Theoretic Analysis of Neural Data”, Frontiers in Neuroinformatics 3:4, 2009.