In this post I will be implementing the Word2vec algorithm with negative sampling from scratch using python. Implementing this from scratch allow to have a better grasp the inner operations of word2vec’s skip-gram and negative sampling approach.
For our analysis, we will use the Open American National Corpus (http://www.anc.org/), which consists of roughly 15 million spoken and written words from a variety of sources. Specifically, we will be using the subcorpus which consists of 4531 Slate magazine articles from 1996 to 2000
How Word2vec works
In this tutorial I will skip over the usual introductory and abstract insights about Word2Vec, and get quickly into of the details of the skip-gram neural network model. Word2vec embeddings are trained using the architecture showed in the following schema.
Word2Vec uses a trick often used in deep learning: we train a neural network with a single of several hidden layers to perform a task, but then the neural network will not actually be used for the task that it trained for! Instead, the goal is actually just to learn the weights of the final hidden layer. These weights are actually the vector representation that we are looking for, and for the specific network architecture we are using they form the Word2vec Embeddings.
Skip-gram vs CBOW
The Skip-gram model learns to predict the probability of context words given a target word. On the other hand, the CBOW model predicts the probability of a target word given to its context. The context is the word around the target word in a given window
consider the following sentence
I watched the movie with my friends
a skip-gram model will learn to predict the words watched, the and with my, given the word movie. While in CBOW we will be predicting movie given the words watched, the and with my. Both models will analyze the sentences based on a context window. The size of this window is a parameter to be fixed before running the neural networK.
The following figure shows how a sentence is analyzed with a window of 2.
In the Skip-gram approach, pair of words from the context are associated with the center word. For example in the above sentence, we start by choosing the the first word as the center word, and the next 2 words as the context words. So “I” will be associate first with “watched” and then with “the” in an input output relationship as shown in the table below:
Assuming that the corpus contains only this one sentence, the one hot encoding will give us the following associations:
taking as example the first couple of words, the following figure represents the input/output relation that the neural network will learn.
Now the corpuses typically contain millions of such sentences and the length of the one-hot vector will be equal to the number of unique words in the vocabulary. typically the vocabulary size is large (ranging form few thousands to few millions).
assuming that the size of the vocabulary is 100000 words, and that the embedding dimension is 100 (the size of the hidden layer), there will be more that 2 millions weight to update for hundred millions of sentences, that demands a huge amount of time and memory to process?
As we process the corpus there will be two “problems” words like “the”. These words happen frequently in the corpus so we will have the following observation:
- Very frequent words like “the” happens in a lot of contexts and does not tell a lot about the meaning of adjacent words.
- The will be much more samples of these words than what we would need.
Word2Vec implements “subsampling” to address this. Each word in the training corpus, will be deleted from the the training with a certain probability that depends on to its frequency.
the removal of the sampled word will be decided for each window, which will reduce the number of examples presented to the neural network.
When training a neural network we adjust all neurons weights slightly when each training sample is presented. The size of the word vocabulary means that the neural network will have a large number of weights, all of which would be updated slightly by every one of our training samples! This means that our neural network may take a long time to train and requires large memory to store the weight matrix. fortunately there is a solution; negative sampling.
Negative sampling addresses this by having each training sample only modify a small percentage of the weights, rather than all of them. This made possible by the fact that the input and output are one-hot encoded.
When training the network on the word pair (“the”, “movie”), only the neuron corresponding to movie is required to be 1 the whole large vector will all be required to be zero.
In negative sampling, we will randomly select a small number of “negative” words and update the weights for them only. We will also still update the weights for our “positive” word (the word “movie” in our example).
These “negative samples” are selected using the frequency distribution of the words in the corpus. The probability to be selected in the negative sample depends on the frequency of the word. More frequent words are then more likely to be selected as negative samples.
In the next blog I will go into the implementation details of the word2vec methodology.