The concept of *entropy* is itself confusing and, so far, with high entropy.
When I associate it with the concept of mutual information, its entropy
decreases. Alright, I got my chance to confuse the reader and that was
actually fun.

*Entropy* is one of the most confusing concepts that has been
borrowed by computer scientists from physicists studying thermodynamics.
Just ignore for the next two lines what *entropy* should be. Think about it as
something that measures something else.

Now, imagine a bunch of molecules in a glass at time *t*0, temperature *T*0 and pressure *P*0. The entropy of that system would be *H*0

As the temperature decreases, the molecules slow down and tend to stabilise to a fixed position. As time goes, the entropy of the system decreases and the information, as the *“certainty”* of the exact position of each molecule increases. This can be extended to an extreme case in which the temperature is so low (absolute zero) that all the molecules remain in a position that we can measure exactly. In that case we are not only 100% sure that the measured position is the real one, but also that the system cannot come into a different configuration. The entropy of such a system is at its minimum. No uncertainty. No alternative configurations.

As the temperature decreases, the molecules slow down and tend to stabilise to a fixed position. That’s when entropy is at its minimum

Since we’re not doing physics here, let’s go back to planet earth and do some information theory.
The concept of *entropy* is somehow linked to the **amount of uncertainty** of a system and to the amount of *information* that is present in a random signal.
The entropy at a source that emits a signal *x* with probability *p*(*x*) is given by *H*(*X*)=*p*(*x*)*l**o**g*(1*p*(*x*))

If the message can be represented by an alphabet of *M* symbols, *X*=*x**i*,*i*=1…*M* and the source emits *x**n* symbols with 0<;*n*<;∞, the entropy at
the source is *H*(*X*)=∑*M**i**p**i**l**o**g*(1*p**i*).
Usually, the term *l**o**g*1*p**i* is referred to as *I*(*X*)

and called **information**.

A quite simple explanation of this is that a very frequent symbol (for which *pi* would be high) contains little information; a rare symbol, on the other hand, contains a high amount of information about the overall message. It makes perfect sense to me. Or does it?

With this said, let’s jump to the mutual information between two variables *I*(*X*;*Y*)

This quantity measures the mutual dependence between *X* and *Y*. It is given by

*I*(*X*;*Y*)=*H*(*X*)−*H*(*X*|*Y*)=∑*y*∑*x**p*(*x*,*y*)*l**o**g*(*p*(*x*,*y*)*p*(*x*)*p*(*y*)),
which basically translates into “how much information from knowing *Y*, reduces the uncertainty about *X*?”

In fact, if *X* and *Y* are independent, then *p*(*x*,*y*)=*p*(*x*)*p*(*y*) and *I*(*X*;*Y*)=0

This too makes perfect sense to me. There are some properties that make the link between mutual information and entropy even stronger. I will list a few:

1. *H*(*X*)=*I*(*X*;*X*), means that the mutual information between *X* and itself is its entropy. Once *X* is known, the amount of uncertainty about *X* itself is indeed its entropy

2. with *H*(*X*|*X*)=0, one means that the amount of uncertainty about *X*, that remains after *X* is known is 0

3. More generally, *I*(*X*|*X*)≥*I*(*X*;*Y*), which means that a variable contains at least as much information as the one provided by any other variable.

4. Finally, *H*(*X*)≥*H*(*X*|*Y*), which means that uncertainty decreases as other variables are known (namely, as the system goes towards a fixed certain state).

One elegant interpretation of entropy in statistics is the Kullback-Leibler divergence

Let’s revisit these concepts in statistics now. One of the most explicative interpretations of mutual information is the one that recalls the Kullback-Leibler distance between distributions.

It represents mutual information as *I*(*X*;*Y*)=*DKL*(*p*(*x*,*y*)||*p*(*x*)*p*(*y*))

that I find elegant and amazing at the same time. Let me just add this reconstruction: *p*(*x*|*y*)=*p*(*x*,*y*)*p*(*y*),*p*(*x*,*y*)=*p*(*x*|*y*)*p*(*y*) *I*(*X*;*Y*)=∑*yp*(*y*)∑*xp*(*x*|*y*)*logp*(*x*|*y*)*p*(*y*)*p*(*x*)*p*(*y*)=∑*yp*(*y*)*DKL*(*p*(*x*|*y*)||*p*(*x*))=*Ey*[*DKL*(*p*(*x*|*y*)||*p*(*x*)]

that means the more *p*(*x*|*y*) differs from *p*(*x*), the higher the amount of “information gain”.