Word2vec Explained
The power of word vectors is an exploding area of research that companies such as Google and Facebook have invested in heavily, given its power of encoding the semantic and syntactic meaning of individual words.
It's no coincidence that Spark implemented its own version of word2vec, which can also be found in Google's Tensorflow library and Facebook's Torch. More recently, Facebook announced a new real-time text processing called deep text, using their pre-trained word vectors, in which they showcased their belief in this amazing technology and the implications it has or is having on their business applications.
The inspiration behind word vectors
Traditional NLP approaches needed to convert individual words created through tokenization into a format understandable by a computer algorithm. For this, you need to create a TF-IDF matrix for converting a single review of N tokens into a fixed representation. Here there are two crucial aspects of running in the background:
- An integer ID is assigned to individual words. For example, the word friend might be assigned to 39,584, while the word bestie might be assigned to 99,928,472. Cognitively, a friend is very similar to bestie; however, converting these tokens into integer IDs can make it lose any notion of similarity.
- You consequently lose the context with which the token was used by converting each token into an integer ID. This is crucial as you need to understand how the two tokens are used in their respective contexts to understand the cognitive meaning of words and train a computer to learn that friend and bestie are similar.
This limited functionality of traditional NLP techniques made it necessary for researchers like Tomas Mikolov and others explored methods that employ neural networks to better encode the meaning of words as a vector of N numbers (for example, vector bestie = [0.574, 0.821, 0.756,..., 0.156]).
When calculated properly, you’ll discover that the vectors for bestie and friend are close in space, whereby closeness is defined as a cosine similarity. It turns out that these vector representations also known as word embeddings give you the ability to capture a richer understanding of the text.
Introduction to word vectors
Word2vec represents a family of algorithms that try to encode the semantic and syntactic meaning of words as a vector of N numbers (hence, word-to-vector is word2vec).
A word vector, in its simplest form, is merely a one-hot-encoding, whereby every element in the vector represents a word in your vocabulary, and the given word is encoded with 1 while all the other words elements are encoded with 0.
Suppose your vocabulary only has the following movie terms: Popcorn, Candy, Soda, Tickets, and Blockbuster. Following the logic, you could encode the term Tickets as follows:
Using this simplistic form of encoding, which is what you do when you create a bag-of-words matrix, there is no meaningful comparison you can make between words (for example, is Popcorn related to Soda; is Candy similar to Tickets?).
Word2vec tries to remedy these limitations via distributed representations for words. Suppose that for each word, you have a distributed vector of, say, 300 numbers that represent a single word, whereby each word in your vocabulary is also represented by a distribution of weights across those 300 elements. Now, your picture would drastically change to look something like this:
Now, given this distributed representation of individual words as 300 numeric values, you can make meaningful comparisons between words using a cosine similarity. That is, using the vectors for Tickets and Soda, you can determine that the two terms are not related, given their vector representations and their cosine similarity to one another.
And that's not all you can do! Mikolov et. al performed mathematical functions of word vectors to make some incredible findings in their ground-breaking paper; in particular, the authors give the following math problem to their word2vec dictionary:
V(King) - V(Man) + V(Woman) ~ V(Queen)
It turns out that these distributed vector representations of words are extremely powerful in comparison questions (for example, is A related to B?), which is all the more remarkable when you consider that this semantic and syntactic learned knowledge comes from observing lots of words and their context with no other information necessary. You did not have to tell your machine that Popcorn is a food, noun, singular, and so on.
How is this possible? Word2vec employs the power of neural networks in a supervised fashion to learn the vector representation of words (which is an unsupervised task). If that sounds a bit like an oxymoron at first, fear not! Everything will be made clearer with the Continuous Bag-of-Words model, commonly referred to as just the CBOW model.
The CBOW model
First, consider a simple movie review, which will act as your base example in the next few sections:
Now, imagine that you have a window that acts as a slider, which includes the main word currently in focus (highlighted in red in the following image), in addition to the five words before and after the focus word (highlighted in yellow):
The words in yellow form the context that surrounds the current focus word, ideas. These context words act as inputs to your feed-forward neural network, whereby each word is encoded via one-hot-encoding (all other elements are zeroed out) with one hidden layer and one output layer:
In the preceding diagram, the total size of your vocabulary (for example, post-tokenization) is denoted by a capital C, whereby you perform one-hot-encoding of each word within the context window--in this case, the five words before and after your focus word, ideas.
At this point, propagate your encoded vectors to your hidden layer via weighted sum--just like a normal feed-forward neural network--whereby, you specify beforehand the number of weights in your hidden layer.
Finally, a sigmoid function is applied from the single hidden layer to the output layer, which attempts to predict the current focus word. This is achieved by maximizing the conditional probability of observing the focus word (idea), given the context of its surrounding words (film, with, plenty, of, smart, regarding, the impact, of, and alien). Notice that the output layer is also of the same size as your initial vocabulary, C.
Herein lays the interesting property of both the families of the word2vec algorithm: it's an unsupervised learning algorithm at heart and relies on supervised learning to learn individual word vectors.
This article is written by Packt Publishing, the leading UK provider of technology eBooks, coding eBooks, videos and blogs; helping it professionals to put software to work.