The rapid diffusion of social media (Facebook Twitter…), and the massive use of forums (Reddit, Quora…) is producing an impressive amount of text data every day. Some of these social media talks reveal key aspects of specific products and services people care about.
Over the last five years, many brands have been using these chats to identify customers’ brand attitudes. A few years ago such a task would have been called a privacy violation. Today, it goes under the name of sentiment analysis.
Research on sentiment analysis has been ongoing for quite a few years, and scientists have solved crucial problems. The most important one is classifying sentences into classes depending on how positive that attitude sounds. negative or neutral the sentiment is. You may choose a basic three-entries classification (‘positive’, ‘neutral’, ‘negative’), or a more refined one.
In order to assess the sentiment of a sentence, there are some challenging issues one has to deal with. In many cases it’s not possible classify sentences by words. At least not immediately. A bit of processing is mandatory because computers do not understand text. In fact, computers represent text – as well as any other forms of information – as numbers. Therefore the first processing task is finding a numerical equivalent for the words in the sentence.
Below we’re going to discuss some viable ways to reach that goal.
The traditional way
The traditional way of representing words in numerical form is by means of one-hot encoding.
Suppose one has a vocabulary of N=10000 words. Each word can be encoded as an N-dimensional vector with all zero elements except at the position corresponding to the index of that word in the vocabulary. That is, if word “hello” is the 42nd word in the vocabulary, its associated vector shall have 10000 zeros, except a 1 at the 42-nd position.
The main drawback of this approach is that the distance between any pair of words is always the same. It would be more appropriate if it changed according to the fact that words with a similar meaning might be at a lower distance with respect to less similar ones. Linguistics usually refer to such distance as semantics. Which brings us to the concept of word embedding.
In a word embedding setting, each word maps onto a point in a particular space. This space has usually between 50 and 300 dimensions. In such a space, words with similar meaning are closer. For example the vectors encoding the words apple and orange are closer than the vectors encoding apple and tulip. This means that the dimensions of the space vector are capable of capturing the semantic of each word.
There might be different features, within the same dataset. Some may encode age, some others encode gender, or word category (food, person, animal, etc). Models using word embedding deliver a much greater performance than the ones using one-hot-encoding vectors.
The last few years saw a surge of NLP applications touching many fields. Examples include machine translation, named entity recognition, speech synthesis and sentiment classification. Probably the most popular application is word2vec, proposed by Mikolov et al. in 2013 (see ). Actually, the idea behind word2vec became the basic building block for many other NLP applications. Let’s see how it works.
Word2vec: an example of word embedding
The idea behind word2vec improves on traditional NLP methods by orders of magnitude. Let’s discuss it using an example.
The drawbacks of one-hot encoding
Suppose one has a dataset with many sentences and wants to build a model that predicts the next word from a context of M words. The naive solution would be to build a neural network that takes the context as input and predicts the probability of the next word, for each word in the vocabulary. The input context would be one-hot encoded, meaning that the input will be a <10000 x M> matrix.
One major issue with such an approach is the fact that extending the vocabulary to, say one million words would make the problem intractable. It is very common to deal with a 1m-word language, especially when words from the regular vocabulary, urban language, slang words etc. are considered.
The embedding matrix
To mitigate the problem of variable explosion caused by feeding one-hot-encoded vectors to the model, one could instead consider vectors in a lower dimensional space. This trick can be solved by multiplying the input vector by an embedding matrix E that would reduce the dimensions from 10000 to, say 300.
In a previous post titled Deep feature extraction and transfer learning we explain the notion of embedding in a more generic sense.
As a matter of fact, the embedding matrix is nothing more than an additional layer of the neural network. Hence, all the machinery already in place (in particular gradient descent and back-propagation) would be seamlessly applied. Given a large corpus of text, one can continuously predict the next word given a context of M words, by means of a rolling window that scans the corpus from beginning to end. This approach would artificially create more data and force the algorithm to learn a pretty interesting internal representation of words in the new lower dimensional space.
In order to allow a more efficient implementation, the details of the word2vec algorithm are slightly different than our description. However, the basic idea, represented in the picture below, is correct.
Schematic representation of how to learn a word embedding matrix E by training a neural network that, given the previous M words, predicts the next word in a sentence.
Word embedding and transfer learning
One interesting fact about word embedding is that it allows to build accurate natural language processing models even without large text datasets at hand but can exploit transfer learning. With transfer learning it is possible to take advantage of the word representations learned from large amounts of non-labeled text. Such representations can then be reused for the task at hand, for which a smaller dataset – or no dataset at all – is available.
Below we provide a snippet to load a pre-trained embedding matrix built on large corpora via the
gensim Python library.
In the example above, we loaded a pre-trained model constructed from 100 billion words from a dataset of Google News. Then, we show how one can obtain the representation of the word “apple” as a numerical vector, and how it is possible to compute the similarity between two words (e.g. “king” and “queen”).
Character and sub-word level embedding
It is possible to construct numerical representations for single characters and parts of words too. As suggested in the post titled “The Best Embedding Method for Sentiment Classification” , sub-word level embedding seems to perform in the specific case of sentiment classification. In this use case in fact the dataset contains many informal words or slangs, for which there would be no equivalent in a standard vocabulary. Taking into account the sub-word (e.g. the root, or just a substring contained in it) can be more beneficial than considering the entire word.
Let’s try to understand why.
Consider the words ‘coooool’ and ‘woooooow’. Normally, either they would be removed during preprocessing or they would be transformed into the words ‘cool’ and ‘wow’, respectively. Even in the latter case one would lose some information, because ‘coooool’ and ‘woooooow’ have stronger emotional content compared to just ‘cool’ and ‘wow’.
Representing these words at the subword-level, and encoding ‘co’, ‘ooo’, ‘ol’ and ‘wo’, ‘oooo’, ‘ow’ differently, the researcher is in fact enforcing a distinction among them. Such informal words are very common in short text messages from social networks, like Twitter and Reddit, but also our chats are more and more packed of slangs of the same format.
Remember this for your next NLP project!