The concept of *entropy* is itself confusing and, so far, with high entropy.
When I associate it with the concept of mutual information, its entropy
decreases. Alright, I got my chance to confuse the reader and that was
actually fun.

*Entropy* is one of the most confusing concepts that has been
borrowed by computer scientists from physicists studying thermodynamics.
Just ignore for the next two lines what *entropy* should be. Think about it as
something that measures something else.

Now, imagine a bunch of molecules in a glass at time *t*0, temperature *T*0 and pressure *P*0. The entropy of that system would be *H*0

As the temperature decreases, the molecules slow down and tend to stabilise to a fixed position. As time goes, the entropy of the system decreases and the information, as the *“certainty”* of the exact position of each molecule increases. This can be extended to an extreme case in which the temperature is so low (absolute zero) that all the molecules remain in a position that we can measure exactly. In that case we are not only 100% sure that the measured position is the real one, but also that the system cannot come into a different configuration. The entropy of such a system is at its minimum. No uncertainty. No alternative configurations.

As the temperature decreases, the molecules slow down and tend to stabilise to a fixed position. That’s when entropy is at its minimum

Are you looking for advise about your data analytics strategy?

Since we’re not doing physics here, let’s go back to planet earth and do some information theory.
The concept of *entropy* is somehow linked to the **amount of uncertainty** of a system and to the amount of *information* that is present in a random signal.
The entropy at a source that emits a signal *x* with probability *p*(*x*) is given by *H*(*X*)=*p*(*x*)*l**o**g*(1*p*(*x*))

If the message can be represented by an alphabet of *M* symbols, *X*=*x**i*,*i*=1…*M* and the source emits *x**n* symbols with 0<;*n*<;∞, the entropy at
the source is *H*(*X*)=∑*M**i**p**i**l**o**g*(1*p**i*).
Usually, the term *l**o**g*1*p**i* is referred to as *I*(*X*)

and called **information**.

A quite simple explanation of this is that a very frequent symbol (for which *pi* would be high) contains little information; a rare symbol, on the other hand, contains a high amount of information about the overall message. It makes perfect sense to me. Or does it?

With this said, let’s jump to the mutual information between two variables *I*(*X*;*Y*)

This quantity measures the mutual dependence between *X* and *Y*. It is given by

*I*(*X*;*Y*)=*H*(*X*)−*H*(*X*|*Y*)=∑*y*∑*x**p*(*x*,*y*)*l**o**g*(*p*(*x*,*y*)*p*(*x*)*p*(*y*)),
which basically translates into “how much information from knowing *Y*, reduces the uncertainty about *X*?”

In fact, if *X* and *Y* are independent, then *p*(*x*,*y*)=*p*(*x*)*p*(*y*) and *I*(*X*;*Y*)=0

This too makes perfect sense to me. There are some properties that make the link between mutual information and entropy even stronger. I will list a few:

1. *H*(*X*)=*I*(*X*;*X*), means that the mutual information between *X* and itself is its entropy. Once *X* is known, the amount of uncertainty about *X* itself is indeed its entropy

2. with *H*(*X*|*X*)=0, one means that the amount of uncertainty about *X*, that remains after *X* is known is 0

3. More generally, *I*(*X*|*X*)≥*I*(*X*;*Y*), which means that a variable contains at least as much information as the one provided by any other variable.

4. Finally, *H*(*X*)≥*H*(*X*|*Y*), which means that uncertainty decreases as other variables are known (namely, as the system goes towards a fixed certain state).

One elegant interpretation of entropy in statistics is the Kullback-Leibler divergence

Let’s revisit these concepts in statistics now. One of the most explicative interpretations of mutual information is the one that recalls the Kullback-Leibler distance between distributions.

It represents mutual information as *I*(*X*;*Y*)=*DKL*(*p*(*x*,*y*)||*p*(*x*)*p*(*y*))

that I find elegant and amazing at the same time. Let me just add this reconstruction: *p*(*x*|*y*)=*p*(*x*,*y*)*p*(*y*),*p*(*x*,*y*)=*p*(*x*|*y*)*p*(*y*) *I*(*X*;*Y*)=∑*yp*(*y*)∑*xp*(*x*|*y*)*logp*(*x*|*y*)*p*(*y*)*p*(*x*)*p*(*y*)=∑*yp*(*y*)*DKL*(*p*(*x*|*y*)||*p*(*x*))=*Ey*[*DKL*(*p*(*x*|*y*)||*p*(*x*)]

that means the more *p*(*x*|*y*) differs from *p*(*x*), the higher the amount of “information gain”.