AI knows biology. Or does it?

Proteins as sentences

The successes of deep learning for text analytics, also introduced in a recent post about sentiment analysis and published here are undeniable. Many other tasks in NLP have also benefitted from superiority of deep learning methods over more traditional approaches. Such extraordinary results have also been possible due to the neural network approach to learn meaningful character and word embeddings, that is the representation space in which semantically similar objects are mapped to nearby vectors. All this is strictly related to a field one might initially find disconnected or off-topic: biology.
Living organisms are characterised by the function of proteins that have the main purpose of catalyzing metabolic reactions, performing DNA replication and transporting molecules around, just to name a few.

The best way to represent a protein is in the form of a sequence of other smaller molecules called amino acids. For this reason, it is quite common to see proteins represented as strings whenever computer software is involved. Amino acids can be easily thought as the words of a vocabulary (of only 25 elements/words), and every protein can be thought as a sentence, just like in English. Researchers who have the tendency to treat similar problems with similar tools, have found NLP models to be versatile enough to be applied to the problem of understanding protein sequences.
In a recent paper titled “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, Alexander Rives et al. have studied proteins with neural networks and found something embarrassingly interesting. Similarly to NLP approaches used to predict the missing word in a sentence, they have trained a deep neural network to predict masked amino acids on 250 million sequences with 86 billion amino acids in total. To be more specific, they used a powerful model architecture called the Transformer (of which a more detailed reading can be found in [2]). The Transformer algorithm has the ability to model long range dependencies within a sequence thanks to its core component called the attention mechanism or self attention, which will be explained next.


The self-attention mechanism

The self attention method allows a sequence model to focus its attention on a specific part of the input, in order to have a more accurate prediction of the output. It is one of the most influential and fascinating ideas in deep learning. The intuition behind such a method is as simple as one might think. Imagine a human who is performing a translation from Italian to English. Given a relatively long sentence in Italian, a human translator would not read the entire paragraph, memorise it and perform the translation in English. Instead she would rather read a first chunk, generate part of the translation, then look at the second part, generate a few more words and so on.
While this sounds familiar to a human, it was not the case for traditional NLP machine learning models. The attention model introduced this way of proceeding with sequences.
From a computational perspective, when generating a specific English word, a machine translation model with attention mechanism would not use the entire Italian sentence, but only specific words which are assigned higher weight and that will drive the translation along. Such weights are computed dynamically by a specific neural network that is trained jointly with the machine translation model (please refer to [3] for a more technical and in depth explanation of the attention model).

Once a deep learning model based on self attention, such as the Transformer, is trained to perform certain NLP tasks like language translation, it is able to have a good understanding of the semantics of the words. This understanding is encoded in the internal layers of the neural networks and it is known as word embedding (see our previous post for a quick refresher). Does a similar process of learning some sort of (biological) semantics happen also in the case of protein sequences? According to Rives and pals, the answer is yes!


Amino acid and protein embedding

After training the Transformer algorithm to process amino acid sequences, the researchers looked at the embedding learned by the model. They found out that the neural network had built a complex representation of the input sequences, reflecting their underlying biological properties such as activity, stability, structure, binding etc. In other words, the deep learning algorithm learned important biochemical properties characterising the different amino acids and proteins, all by itself, without any supervision.

As mentioned at the end of the first paragraph, the deep neural network is trained by noising (i.e. masking) a fraction of amino acids in the input sequence and have the network predict the true amino acid at the noised positions from the complete sequence context. Therefore, the final hidden representation of the same network is a sequence of vectors, one for each position in the input sequence. From this representation, both an amino acid embedding and a protein embedding can be obtained.

In the amino acid embedding, well-separated clusters are visible. Hydrophobic and polar amino acids are grouped separately, including a tight grouping of aromatic amino acids. The negatively charged amino acids and the positively charged ones form two separate groups, and another cluster contains the three smallest molecular weight amino acids.

The protein embedding is obtained by averaging features across the full length of the output sequence, and  it is a lower dimensional representation of the input sequence. In this space, each protein is represented as a single point and similar sequences are mapped to nearby points. Interestingly, proteins with similar functions in different species were clustered together. Furthermore, the learned features allowed to predict the 3-dimensional structure of the protein based only on the raw amino acid sequence and information about their biological activity. Researchers in proteomics and protein structure discovery know how challenging such a task is.

All this means that the neural network could find structures that characterise sequences correctly without human intervention.

Clearly sequence models have found their way in the field of natural language processing (NLP). However, attention-based neural networks are also applicable to data other than text, such as images, audio and generic numerical types. Biology, healthcare, drug discovery and medicine are all fields in which attention-based methods can reach state-of-the-art performance and definitely gain the attention of a wider community. No pun intended.



[1] Rives A., et al., “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, biorxiv, doi:

[2] Vaswani A., et al., “Attention is all you need”, Advances in neural information processing systems, pp. 5998–6008, 2017.

[3] Bahdanau D., et al., “Neural machine translation by jointly learning to align and translate”, arXiv,

Subscribe to our Newsletter