AI doesn’t know biology. Or does it?


Proteins as sentences

In this post we talk about some applications of deep learning to biology. The successes of deep learning in text analytics are undeniable (we recently talked about them in this post). Many other tasks in NLP profited  from superior deep learning methods.  Such amazing results have  been possible by making the neural network  learn meaningful character and word embeddings. That is the representation space,  which maps semantically similar objects to nearby vectors.

Text analytics relates deeply to a seemingly distant field: biology. That happens because of proteins. Proteins characterise every living organism. They catalyze metabolic reactions, perform DNA replication and transport molecules around.

A protein is a sequence of other smaller molecules called amino acids. We can think of amino acids  as the words of a vocabulary (of only 25 elements/words). Proteins are like  sentences, just like in English. For this reason, it is quite common in computer software to represent proteins as strings. 

NLP in biology: The Transformer

Researchers have found they can apply NLP models to solve  the problem of understanding protein sequences. In a recent paper (see [1]), Alexander Rives et al. have studied proteins with neural networks and found something very interesting.

Researchers have used techniques similar to the ones used in NLP to predict the missing word in a sentence. They have trained a neural network that predicts masked amino acids on 250 million sequences.  The total number of amino acids has been set at 86 billion. To reach their goal, they used a powerful model architecture called the Transformer. Refer to [2] for more details.
The Transformer architecture has the ability to model long range dependencies in a sequence. This is possible thanks to its core component called the attention mechanism or self-attention.

Deep learning in biology: the self-attention mechanism

Sequence models that use the self attention method focus their attention on a specific part of the input. It is one of the most influential and fascinating ideas in deep learning. The intuition behind such a method is as simple as one might think. Imagine a human who is translating a text from Italian to English.
Given a long sentence in Italian, a human translator would not translate it all at once. Instead, she would read a first chunk, translate it, then look at the second part, translate a few more words and so on.

While this sounds familiar to a human, it is not the case for traditional NLP machine learning models. The attention model introduces this way of proceeding with sequences.
A machine translation model with attention mechanism would not use the entire Italian sentence. One would assign higher weights only to specific words. Those words would drive the translation along.
Such weights are computed dynamically by a specific neural network, trained jointly with the machine translation model. See [3] for a more technical and in depth explanation of the attention model.

Once training has finished, a self-attention based deep learning model  has good understanding of the words semantics. The internal layers of the neural network encode such understanding. It is usually referred to as word embedding. Read our previous post for a quick refresher. Does a similar process of learning some sort of (biological) semantics happen also in the case of protein sequences? According to Rives and pals, the answer is yes!
So let’s see how does deep learning in biology work.

Amino acid and protein embedding

After training the Transformer algorithm to process amino acid sequences, researchers looked at the embedding learned by the model. Above all, they found out that the neural network had built a complex representation of the input sequences. This in turn reflects their biological properties such as activity, stability, structure, binding etc. In other words, the deep learning algorithm learned important biochemical properties characterising the different amino acids and proteins, all by itself, without any supervision.

As mentioned, the training of the deep neural network  takes place by masking a fraction of amino acids in the input sequence. Moreover, it predicts the true amino acid at the noised positions from the complete sequence context. Therefore, the final hidden representation of the same network is a sequence of vectors, one for each position in the input sequence. From this representation, both an amino acid embedding and a protein embedding can be obtained.

Blending deep learning and amino acid embedding

In the amino acid embedding, well-separated clusters are visible. Hydrophobic and polar amino acids end in  separate groups, containing a tight group of aromatic amino acids. Hence, the negatively charged amino acids and the positively charged ones form two separate groups, and another cluster contains the three smallest molecular weight amino acids.

Once can obtain the protein embedding by averaging features across the full length of the output sequence. That is a lower dimensional representation of the input sequence. In this space, a single point represents each protein. Moreover, similar sequences map  onto nearby points. Interestingly, the algorithm clusters toghether proteins with similar functions in different species. In addition, the learned features allows to predict the 3-dimensional structure of the protein based only on the raw amino acid sequence and information about their biological activity. Researchers in proteomics and protein structure discovery know how challenging such a task is.

In other words, the neural network could find structures that characterise sequences correctly without human intervention.

Clearly sequence models have found their way in the field of natural language processing (NLP). However, attention-based neural networks are also applicable to data other than text, such as images, audio and generic numerical types.
In conclusion, biology, healthcare, drug discovery and medicine are all fields in which attention-based methods can reach state-of-the-art performance and definitely gain the attention of a wider community. No pun intended, of course.



[1] Rives A., et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, biorxiv, doi:

[2] Vaswani A., et al., “Attention is all you need”, Advances in neural information processing systems, pp. 5998–6008, 2017.

[3] Bahdanau D., et al., “Neural machine translation by jointly learning to align and translate”, arXiv,

Subscribe to our Newsletter