“If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning”.
This famous quote from Facebook AI Chief Yann LeCun highlights how humans rely on unsupervised learning. Effective community detection with Markov Clustering. They use such an approach to make sense of the world around them. As a matter of fact, children learn to recognize objects and speak without much supervision. In contrast, sophisticated deep learning algorithms need millions of carefully annotated data. This is necessary to reach human-like performance in tasks such as object detection and language generation.
This post provides you with a great method for effective community detection with Markov Clustering.
In this realm, one can consider clustering as one of the most useful tools to learn interesting patterns from data in an unsupervised fashion.
Whether one categorize emails or performs customer segmentation, clustering is most of the time the way to go. This approach consists of dividing the dataset into groups (or clusters) such that objects belonging to the same cluster are more similar than items belonging to different clusters. It is possible to calculate such similarity according to a specific metric that of course can change with the problem to be solved.
Data scientists use clustering algorithms such as k-means, hierarchical clustering or DBSCAN in exploratory data analysis of tabular data. There is another type of data that is also common: networks. In the network analysis jargon, researchers refer to clustering network data to as community detection. In such a context, the clusters represent the communities. Refer to  and  for more details on community detection and state-of-the-art algorithms for clustering networks, respectively.
The purpose of this post is to unravel some of the most important properties of a specific algorithm referred to as Markov clustering (MCL). Researchers have designed this algorithm primarily to perform community detection. However, one can use MCL to cluster tabular data too. A more technical explanation of the MCL algorithm is provided in reference  and an application of the method to cluster biological sequences is detailed in .
Let’s now refresh some concepts of graph theory that will become useful to understand the working mechanism of the MCL technique.
Graphs and random walks
As mentioned in a previous post, many systems in nature can be represented as graphs. Graphs are nothing more than a set of nodes connected by edges. Social networks probably are the most familiar example of graphs. Nodes represent users while an edge connecting any two nodes represents the fact that the two users are interacting with each other.
One of the common approaches to analyze graphs and derive insights from simulations is the so-called random walk. Here is an example that clarifies what random walk means. Suppose there is a tourist who is starting a trip from his hometown and is going to visit several cities. One can think of the cities as the nodes of a graph. Such nodes are connected by an edge whenever it is possible to directly reach one city from the other. Let’s assume the aforementioned tourist is amazed by surprises. She hasn’t planned her trip in advance and she randomly picks the next city.
A mathematical representation
From a mathematical point of view, the trip of the tourist is a stochastic process known as a random walk.
A natural question that may arise: which cities will be visited next by our tourist in the long term? The answer to such a question depends on how the cities are connected to each other. For example, if cities form well-separated clusters (connected components), and the tourist starts her trip inside one of these clusters, she may never be able to reach any city of the other clusters.
Simulating a random walk on a graph represents precisely the idea behind the MCL algorithm.
The MCL algorithm
The purpose of the MCL algorithm is to find a cluster structure. This is achieved by simulating a random walk on a graph until it reaches equilibrium.
The first step one needs to take is to obtain the aforementioned graph.
- In the case of network data, the only task to perform is to compute the adjacency (or affinity) matrix A. Such a matrix perfectly describes the network. The definition of such a matrix is = 1 if there is an edge from vertex i to vertex j, and 0 otherwise
- In the case of tabular data, one has to construct a similarity matrix. This happens by computing pairwise similarities between all existing records (i.e. the rows of the table). Such similarity matrix represents a weighted graph. The nodes of such a graph represent the observations and the edges have weights corresponding to the similarity score between them.
Expansion and inflation
By properly scaling either the adjacency or the similarity matrix, one can obtain the Markov matrix. This is a matrix of probabilities representing the chances for a node reaching another one it is connected to.
Finally, once one builds the Markov matrix, the random walk on the graph is simulated by alternating two operators called expansion and inflation.
Expansion allows the random walker to take higher length paths. For instance taking a large number of steps from one node to the other.
Inflation changes the transition probabilities by favouring more probable walks over less probable ones.
The combination of expansion and inflation will boost the probabilities of walks inside each cluster. Moreover it will reduce walks between the clusters. This happens because higher length paths are more common within the same clusters than between different clusters.
Such behavior will emerge in the long term. The iteration of expansion and inflation operations leads to the separation of the graph into different connected components. Therefore, unlike many clustering algorithms that need the user to specify the expected number of clusters beforehand, sometimes mistakenly, the MCL algorithm provides a partition of the data into clusters that naturally arises from the graph topology itself.
A practical example in Python
Let’s see how to use the MCL algorithm in practice with Python and how to perform effective community detection with Markov Clustering.
First of all, install the required package that implements the algorithm with
pip install markov_clustering[drawing] and the
networkx package to analyze and visualize graphs.
Then execute the code below
In the above code, we use the MCL algorithm to detect the clusters, also called communities, in Zachary’s karate club, which is a popular benchmark to test community detection algorithms. Zachary’s karate club is a social network of a university karate club. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club.
During the study, a conflict arose between the administrator (node 0) and instructor (node 33) which led to the split of the club into two subgraphs. Half of the members formed a new club around the instructor, while members from the other part found a new instructor or gave up karate.
As shown in the picture below, the MCL algorithm nicely detects the presence of two distinct communities in the network, comprising the members that mostly had interactions either with the administrator or the instructor:
In this post, we have seen how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix. Moreover, we highlighted how one can build a similarity graph and then run an effective community detection algorithm on such graph (such as Markov clustering or other state-of-the-art methods) to find clusters in tabular data.