The rapid diffusion of social media (Facebook Twitter…), and the massive use of forums (Reddit, Quora…) is producing an impressive amount of text data every day. Some of these social media talks reveal key aspects of specific products and services people care about.
Over the last five years, many brands have been using these chats to identify customers’ brand attitudes. A few years ago such a task would have been called a privacy violation. Today, it goes under the name of sentiment analysis.
Research on sentiment analysis has been ongoing for quite a few years, and scientists have solved crucial problems. The most important one is classifying sentences into classes depending on how positive that attitude sounds. You may choose a basic three-entries classification (‘positive’, ‘neutral’, ‘negative’), or a more refined one.
In order to assess the sentiment of a sentence, there are some challenging issues one has to deal with. In many cases it’s not possible classify sentences by words. At least not immediately. A bit of processing is mandatory because computers do not understand text. In fact, computers represent text – as well as any other forms of information – as numbers. Therefore the first processing task is finding a numerical equivalent for the words in the sentence.
Below we’re going to discuss some viable ways to reach that goal.
The traditional way
The traditional way of representing words in numerical form is by means of one-hot encoding.
Suppose one has a vocabulary of N=10000 words. Each word can be encoded as an N-dimensional vector with all zero elements except at the position corresponding to the index of that word in the vocabulary. That is, if word “hello” is the 42nd word in the vocabulary, its associated vector shall have 10000 zeros, except a 1 at the 42-nd position.
The main drawback of this approach is that the distance between any pair of words is always the same. It would be more appropriate if the distance changed according to similarities in meaning (e.g. synonyms were very close). Linguistics usually refer to such distance as semantics. These thoughts bring us to the concept of word embedding.
In a word embedding setting, each word maps onto a point in a particular space. This space has usually between 50 and 300 dimensions. In such a space, words with similar meaning are closer. For example the vectors encoding the words apple and orange are closer than the vectors encoding apple and tulip. This means that the dimensions of the space vector are capable of capturing the semantic of each word.
There might be different features, within the same dataset. Some may encode age, some others encode gender, or word category (food, person, animal, etc). Models using word embedding deliver a much greater performance than the ones using one-hot-encoding vectors.
The last few years saw a surge of NLP applications touching many fields. Examples include machine translation, named entity recognition, speech synthesis and sentiment classification. Probably the most popular application is word2vec, proposed by Mikolov et al. in 2013 (see ). Actually, the idea behind word2vec became the basic building block for many other NLP applications. Let’s see how it works.
Word2vec: an example of word embedding
The idea behind word2vec improves on traditional NLP methods by orders of magnitude. Let’s discuss it using an example.
The drawbacks of one-hot encoding
Suppose you have a large dataset of sentences. Your aim is to build a model predicting the next word in a sentence, choosing from a set of M words. The input context would be one-hot encoded, meaning that the input will be a <10000 x M> matrix. Such an approach would be impossible to follow with a large vocabulary (say one million words). Such large vocabularies are common, when, besides wordbook entries, you also consider urban language, slang words, etc.
Contrarily to a one-hot-encoding representation (left), a word embedding (right) can capture the different meaning of the words and their similarities.
The embedding matrix
To mitigate the problem of variable explosion caused by feeding one-hot-encoded vectors to the model, one could instead consider vectors in a lower dimensional space. This trick can be solved by multiplying the input vector by an embedding matrix E that would reduce the dimensions from 10000 to, say 300.
In a previous post titled Deep feature extraction and transfer learning we explain the notion of embedding in a more generic sense.
As a matter of fact, the embedding matrix just an additional layer in the neural network. Hence, all the machinery already in place (in particular gradient descent and back-propagation) would be seamlessly applied. Given a large corpus of text, one can continuously predict the next word given a context of M words, by means of a rolling window that scans the corpus from beginning to end. This approach would artificially create more data and force the algorithm to learn a pretty interesting internal representation of words in the new lower dimensional space.
In order to allow a more efficient implementation, the details of the word2vec algorithm are slightly different than our description. However, the basic idea, represented in the picture below, is correct.
Schematic representation of how to learn a word embedding matrix E by training a neural network that, given the previous M words, predicts the next word in a sentence.
Word embedding and transfer learning
One noteworthy fact about word embedding is its ability to build accurate NLP models without using large text datasets. That happens by exploiting transfer learning. Transfer learning takes advantage of word representations learned from large amounts of non-labeled text. Such representations can then be reused for the current task. The task at hand may then have smaller a dataset or no dataset at all. It doesn’t really matter.
Below we provide a snippet to load a pre-trained embedding matrix built on large corpora via the
gensim Python library.
In the example above, we loaded a pre-trained model constructed from 100 billion words from a dataset of Google News. Then, we show how one can obtain the representation of the word “apple” as a numerical vector, and how it is possible to compute the similarity between two words (e.g. “king” and “queen”).
Character and sub-word level embedding
It is possible to construct numerical representations for single characters and parts of words too. As suggested in , sub-word level embedding seems to perform well in the specific case of sentiment classification. In this use case in fact the dataset contains many informal words or slangs, for which there would be no equivalent in a standard vocabulary. Taking into account the sub-word (e.g. the root, or just a substring contained in it) can be more beneficial than considering the entire word.
Let’s try to understand why.
Consider the words ‘coooool’ and ‘woooooow’. Normally, either they would be removed during preprocessing or they would be transformed into the words ‘cool’ and ‘wow’, respectively. Even in the latter case one would lose some information, because ‘coooool’ and ‘woooooow’ have stronger emotional content compared to just ‘cool’ and ‘wow’.
Representing these words at the subword-level, and encoding ‘co’, ‘ooo’, ‘ol’ and ‘wo’, ‘oooo’, ‘ow’ differently, the researcher is in fact enforcing a distinction among them. Such informal words are very common in short text messages from social networks, like Twitter and Reddit, but also our chats are more and more packed of slangs of the same format.
Remember this for your next NLP project!