The state of sentiment analysis: word, sub-word and character embedding

The rapid diffusion of social media like Facebook and Twitter, and the massive use of different types of forums like Reddit, Quora, etc., is producing an impressive amount of text data every day. Some of these social media platforms contain customer conversations about a brand, revealing key aspects of specific products and services that people care about.
There is one specific activity that many business owners have been contemplating over the last five years, that is identifying the social sentiment of their brand, by analysing the conversations of their users. A few years ago such a task would have been referred to as privacy violation. Today, it goes under the name of sentiment analysis.
Researchers and NLP practitioners have been working on sentiment analysis quite a few years already. One main issue solved by NLP models consists of classifying sentences into classes depending on how positive, negative or neutral the sentiment is (of course a higher number of classes might be considered).
Such analysis is usually conducted by analysing the single words that compose a sentence. At Amethix, we have seen and implemented more sophisticated methods to perform a more robust and accurate classification of sentences. In order to assess the sentiment of a sentence from the English language (but this applies to any other language too), there are some challenging issues one has to deal with. In many cases one simply cannot just classify sentences by words. At least not immediately. A bit of processing is required due to the simple fact that computers and algorithms do not understand text. In fact, text – as all other forms of information – is internally represented as numerical values. A big chunk of the preprocessing mentioned above consists of finding the numerical equivalent of the words in one’s spoken language.
Below we are going to unveil some viable approaches.


Word embedding

The traditional way of representing words in numerical form is by means of one-hot encoding.
Suppose one has a vocabulary of N=10000 words. Each word can be encoded as an N-dimensional vector with all zero elements except at the position corresponding to the index of that word in the vocabulary. That is, if word “hello” is indexed as the 42nd word in the vocabulary, it will be represented by a vector of 10000 zeros, except a 1 at the 42-nd position.
The main drawback of this approach is that the distance between any pair of words is always the same. It would be more appropriate if it changed according to the fact that words with a similar meaning might be at a lower distance with respect to less similar ones.  Linguistics usually refer to such distance as semantics. Which brings us to the concept of word embedding.

In a word embedding setting, each word is represented as a point into a particular space (usually between 50 and 300 dimensions) such that, for instance, the vectors encoding the words apple and orange are closer  than the vectors encoding apple and tulip. This means that the dimensions of the space vector are capable of capturing the semantic of each word.
Without loss of generality, there might be very different features, within the same dataset. Some may encode age, some others encode gender, or more abstract concepts that indicate if a specific word is food, person, animal, etc. By using word embedding in place of one-hot-encoding vectors, machine learning algorithms have shown better performance than traditional approaches in the field of natural language processing.

While a considerable amount of NLP applications have been built in the last few years, touching fields like machine translation, named entity recognition, speech synthesis and sentiment classification, probably the most popular among all is word2vec, proposed by Mikolov et al. in 2013 and reported in the references [1]. As a matter of fact, such concept became the basic block on top of which many applications have been built.

Contrarily to a one-hot-encoding representation (left), a word embedding (right) can capture the different meaning of the words and their similarities.

The idea behind word2vec improves the more traditional NLP methods by orders of magnitude. Here is why.
Suppose one has a dataset with many sentences and wanted to build a model that predicts the next word from a context of M words. The naive solution to such a challenging task would be to build a neural network that takes the context as input and predicts the probability of the next word, for each word in the vocabulary. The input context would be one-hot encoded, meaning that the input will be a <10000 x M> matrix.
One major issue with such an approach consists in the fact that extending the vocabulary to, say 1 million words would make the above problem intractable. It is very common to deal with a 1m-word language, especially when words from the regular vocabulary, urban language, slang words etc. are considered.
To mitigate the problem of variable explosion caused by feeding one-hot-encoded vectors to the model, one could instead consider vectors in a lower dimensional space. This trick can be solved by multiplying the input vector by an embedding matrix E that would reduce the dimensions from 10000 to, say 300.
In a previous post titled Deep feature extraction and transfer learning we explain the notion of embedding in a more generic sense.

As a matter of fact, the embedding matrix is nothing more than an additional layer of the neural network. Hence, all the machinery already in place (in particular gradient descent and back-propagation) would be seamlessly applied. Given a large corpus of text, one can continuously predict the next word given a context of M words, by means of a rolling window that scans the corpus from beginning to end. This approach would artificially create more data and force the algorithm to learn a pretty interesting internal representation of words in the new lower dimensional space.

In order to accommodate a more efficient implementation, the word2vec algorithm is slightly different than what we have described so far. But its core remains the same and it has been represented in the schema below

Schematic representation of how to learn a word embedding matrix E by training a neural network that, given the previous M words, predicts the next word in a sentence.

One interesting fact about word embedding is that it allows to build accurate natural language processing models even if one does not have very large text datasets at hand but can exploit transfer learning. With transfer learning it is possible to take advantage of the word representations learned from large amounts of non-labeled text. Such representations can then be reused for the task at hand, for which a smaller dataset – or no dataset at all – is available.

Below we provide a snippet to load a pre-trained embedding matrix built on large corpora via the gensim Python library.

In the example above, we loaded a pre-trained model constructed from 100 billion words from a dataset of Google News. Then, we show how one can obtain the representation of the word “apple” as a numerical vector, and how it is possible to compute the similarity between two words (e.g. “king” and “queen”).

Character and sub-word level embedding

It is possible to construct numerical representations for single characters and parts of words too. As suggested in the post titled “The Best Embedding Method for Sentiment Classification” [2], sub-word level embedding seems to perform in the specific case of  sentiment classification. In this use case in fact the dataset contains many informal words or slangs, for which there would be no equivalent in a standard vocabulary. Taking into account the sub-word (e.g. the root, or just a substring contained in it) can be more beneficial than considering the entire word.
Let’s try to understand why.
Consider the words ‘coooool’ and ‘woooooow’. Normally, either they would be removed during preprocessing or they would be transformed into the words ‘cool’ and ‘wow’, respectively. Even in the latter case one would lose some information, because ‘coooool’ and ‘woooooow’ have stronger emotional content compared to just ‘cool’ and ‘wow’.
Representing these words at the subword-level, and encoding ‘co’, ‘ooo’, ‘ol’ and ‘wo’, ‘oooo’, ‘ow’ differently, the researcher is in fact enforcing a distinction among them. Such informal words are very common in short text messages from social networks, like Twitter and Reddit, but also our chats are more and more packed of slangs of the same format.
Remember this for your next NLP project!

References

[1] Mikolov, T. et al., “Distributed Representations of Words and Phrases and their Compositionality”, Advances in Neural Information Processing Systems 26, pages 3111-3119, 2013.

[2] The Best Embedding Method for Sentiment Classification, https://medium.com/@bramblexu/blog-md-34c5d082a8c5

Subscribe to our Newsletter