AI knows biology. Or does it?


Proteins as sentences

In this post we provide you with some findings related some applications of deep learning in biology. The successes of deep learning for text analytics, also introduced in a recent post about sentiment analysis and published here are undeniable. Many other tasks in NLP have also benefitted from superiority of deep learning methods over more traditional approaches.

Such extraordinary results have also been possible due to the neural network approach to learn meaningful character and word embeddings. That is the representation space in which semantically similar objects are mapped to nearby vectors. All this relates to a field one might initially find disconnected or off-topic: biology.
Living organisms are characterised by the function of proteins. Proteins have the main purpose of catalyzing metabolic reactions, performing DNA replication and transporting molecules around, just to name a few.

Representation of biological components

The best way to represent a protein is in the form of a sequence of other smaller molecules called amino acids. For this reason, it is quite common to see proteins represented as strings whenever computer software is involved.
Amino acids can be easily thought as the words of a vocabulary (of only 25 elements/words), and every protein can be thought as a sentence, just like in English. By analogy, researchers have found NLP models to be versatile enough to be applied to the problem of understanding protein sequences.
In a recent paper titled “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, Alexander Rives et al. have studied proteins with neural networks and found something embarrassingly interesting.

Researchers have used similar techniques to NLP to predict the missing word in a sentence. They have trained a neural network that predicts masked amino acids on 250 million sequences.  The total number of amino acids has been set at 86 billion. To be more specific, they used a powerful model architecture called the Transformer . Refer to [2] for a more detailed reading.
The Transformer architecture has the ability to model long range dependencies in a sequence. This is possible thanks to its core component called the attention mechanism or self-attention, which will be explained next.

Deep learning in biology: the self-attention mechanism

The self attention method allows a sequence model to focus its attention on a specific part of the input, in order to have a more accurate prediction of the output. It is one of the most influential and fascinating ideas in deep learning. The intuition behind such a method is as simple as one might think. Imagine a human who is performing a translation from Italian to English.
Given a relatively long sentence in Italian, a human translator would not read the entire paragraph at once. Instead she would read a first chunk, generate part of the translation, then look at the second part, generate a few more words and so on.

While this sounds familiar to a human, it is not the case for traditional NLP machine learning models. The attention model introduced this way of proceeding with sequences.
From a computational perspective, a machine translation model with attention mechanism would not use the entire Italian sentence. One would assign higher weights only to specific words. Those words would drive the translation along.
Such weights are computed dynamically by a specific neural network. Such network is trained jointly with the machine translation model. Refer to [3] for a more technical and in depth explanation of the attention model.

Once a self-attention based deep learning model is trained to perform certain NLP tasks, it has good understanding of the words semantics. The internal layers of the neural network encode such understanding. It is usually referred to as word embedding. Read our previous post for a quick refresher. Does a similar process of learning some sort of (biological) semantics happen also in the case of protein sequences? According to Rives and pals, the answer is yes!
So let’s see how does deep learning in biology work.

Amino acid and protein embedding

After training the Transformer algorithm to process amino acid sequences, researchers looked at the embedding learned by the model. Above all, they found out that the neural network had built a complex representation of the input sequences. This in turn reflects their underlying biological properties such as activity, stability, structure, binding etc. In other words, the deep learning algorithm learned important biochemical properties characterising the different amino acids and proteins, all by itself, without any supervision.

As mentioned, the deep neural network is trained by masking a fraction of amino acids in the input sequence. Moreover, it predicts the true amino acid at the noised positions from the complete sequence context. Therefore, the final hidden representation of the same network is a sequence of vectors, one for each position in the input sequence. From this representation, both an amino acid embedding and a protein embedding can be obtained.

Blending deep learning and amino acid embedding

In the amino acid embedding, well-separated clusters are visible. Hydrophobic and polar amino acids are grouped separately, including a tight grouping of aromatic amino acids. Hence, the negatively charged amino acids and the positively charged ones form two separate groups, and another cluster contains the three smallest molecular weight amino acids.

Once can obtain the protein embedding by averaging features across the full length of the output sequence. That is a lower dimensional representation of the input sequence. In this space, each protein is represented as a single point and similar sequences are mapped to nearby points. Interestingly, proteins with similar functions in different species were clustered together. Moreover, the learned features allowed to predict the 3-dimensional structure of the protein based only on the raw amino acid sequence and information about their biological activity. Researchers in proteomics and protein structure discovery know how challenging such a task is.

In other words, the neural network could find structures that characterise sequences correctly without human intervention.

Clearly sequence models have found their way in the field of natural language processing (NLP). However, attention-based neural networks are also applicable to data other than text, such as images, audio and generic numerical types.
In conclusion, biology, healthcare, drug discovery and medicine are all fields in which attention-based methods can reach state-of-the-art performance and definitely gain the attention of a wider community. No pun intended, of course.

Deep learning in biology


[1] Rives A., et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, biorxiv, doi:

[2] Vaswani A., et al., “Attention is all you need”, Advances in neural information processing systems, pp. 5998–6008, 2017.

[3] Bahdanau D., et al., “Neural machine translation by jointly learning to align and translate”, arXiv,

Subscribe to our Newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *