“If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning”. This famous sentence from Facebook AI Chief Yann LeCun highlights how humans rely mainly on unsupervised learning to make sense of the world around them. Indeed, unlike sophisticated deep learning algorithms that need millions of carefully annotated data to reach human-like performance in tasks such as object detection and language generation, children learn to recognize objects and speak without much supervision.
In this realm, clustering can be certainly considered one of the most useful tools to learn interesting patterns from data in an unsupervised fashion. Whether one likes to automatically categorize emails, or detect different types of customers via an app, clustering is most of the time the way to go. This approach consists of dividing the entire dataset at hand into groups (or clusters) such that objects belonging to the same cluster are generally more similar than items belonging to different clusters. This similarity is calculated according to a specific metric that of course can change with the problem to be solved.
Clustering algorithms such as k-means, hierarchical clustering or DBSCAN are frequently used by data scientists in exploratory data analysis of tabular data. But another type of data that is very common to deal with is represented by networks. In the network analysis jargon, clustering network data is usually referred to as community detection and the clusters are interpreted as communities. Please refer to  and  for more details on community detection and state-of-the-art algorithms for clustering networks, respectively.
The purpose of this post is to unravel some of the most important properties of a specific algorithm referred to as Markov clustering (MCL). This algorithm has been primarily designed to perform community detection. However, it can be used to cluster tabular data too, as it will be clear along with this post. A more technical explanation of the MCL algorithm is provided in reference  and an application of the method to cluster biological sequences is detailed in .
Let’s now refresh some concepts of graph theory that will become useful to understand the working mechanism of the MCL technique.
Graphs and random walks
As mentioned in a previous post, many systems in nature can be represented as graphs, which are nothing more than as a set of nodes connected by edges. The most familiar example of graphs are, without any doubt, social networks, where users are represented as nodes while an edge connecting any two nodes represents the fact that the two users/nodes are interacting with each other, among other things (e. g. if they are friends on Facebook).
One of the common approaches to analyze graphs and derive insights from simulations is the so-called random walk. Here is an example that clarifies what random walk means. Suppose there is a tourist who is starting a trip from his hometown and is going to visit several cities. One can think of the cities as the nodes of a graph, which are connected by an edge whenever it is possible to directly reach one city from the other. Let’s assume the aforementioned tourist is amazed by surprises, so she hasn’t planned his trip in advance and she randomly picks the next city.
From a mathematical point of view, the trip of the tourist is a stochastic process known as a random walk on the graph.
A natural question that may arise is the following: in the long term, which cities will be visited next by our tourist? The answer to such a question depends on how the cities are connected to each other. For example, if cities form well-separated clusters (also called connected components in graph theory), and the tourist starts her trip inside one of these clusters, she may never be able to reach any city of the other clusters.
Simulating a random walk on a graph represents precisely the idea behind the MCL algorithm.
The MCL algorithm
The purpose of the MCL algorithm is to find a cluster structure by simulating a random walk on a graph until it reaches equilibrium.
The first step one needs to take is to obtain the aforementioned graph.
- In the case of network data, a graph is readily available and the only task to perform is to compute the adjacency (or affinity) matrix A that describes the network. Such a matrix is defined in a way that each element = 1 if there is an edge from vertex i to vertex j, and 0 otherwise
- In the case of tabular data, i.e. data arranged in rows and columns, one has to construct a similarity matrix first by computing all pairwise similarities between all existing records (i.e. the rows of the table). Such similarity matrix represents a weighted graph in which the nodes represents the observations and the edges have weights corresponding to the similarity score between them
By properly scaling either the adjacency or the similarity matrix, one can obtain the so-called Markov matrix, which is a matrix of probabilities representing the chances for every node to reach any other node it is connected to in the graph.
Finally, once the Markov matrix has been built, the random walk on the graph is simulated by alternating two operators called expansion and inflation.
Expansion allows the random walker to take higher length paths, i. e. a large number of steps from one node to the other.
Inflation changes the transition probabilities by favoring more probable walks over less probable ones.
Since higher length paths are more common within the same clusters than between different clusters, the combination of expansion and inflation will have the effect of boosting the probabilities of walks inside each cluster and reduce walks between the clusters. Such behavior will emerge in the long term. The iteration of expansion and inflation operations leads to the separation of the graph into different connected components. Therefore, unlike many clustering algorithms that need the user to specify the expected number of clusters beforehand, sometimes mistakenly, the MCL algorithm provides a partition of the data into clusters that naturally arises from the graph topology itself.
Let’s see how to use the MCL algorithm in practice with Python.
First of all, install the required package that implements the algorithm with
pip install markov_clustering[drawing] and the
networkx package to analyze and visualize graphs.
Then execute the code below
In the above code, we use the MCL algorithm to detect the clusters, also called communities, in Zachary’s karate club, which is a popular benchmark to test community detection algorithms. Zachary’s karate club is a social network of a university karate club. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study, a conflict arose between the administrator (node 0) and instructor (node 33) which led to the split of the club into two subgraphs. Half of the members formed a new club around the instructor, while members from the other part found a new instructor or gave up karate.
As shown in the picture below, the MCL algorithm nicely detects the presence of two distinct communities in the network, comprising the members that mostly had interactions either with the administrator or the instructor:
In this post, we have seen how a community detection algorithm known as Markov clustering can be constructed by combining simple concepts like random walks, graphs, similarity matrix. Moreover, we highlighted how one can build a similarity graph and then run a community detection algorithm on such graph (such as Markov clustering or other state-of-the-art methods) to find clusters in tabular data.