In the last post I introduced Word2vec and presented it using intuitive concepts. in this post I will implement these ideas in python.
I will be using the Open American National Corpus (http://www.anc.org/), which consists of roughly 15 million spoken and written words from a variety of sources. Specifically, we will be using the sub corpus which consists of 4531 Slate magazine articles from 1996 to 2000.
I have already downloaded and preprocessed the corpus. You can download the cleaned version here.
import numpy as np import pandas as pd import string from nltk import tokenize
lets import the data using pandas library
import pandas as pd data = pd.read_csv('slate.csv') corpus = data['text'].tolist()
We will first examine word-frequency and we will be looking at the top 1000 terms in this corpus. The top terms are what we can consider as stop-words, since they are common across most sentences and do not capture much of the semantic meaning. As we move down, words that play more important role in conveying the meaning within a sentence or document start to appear.
data["tokens"]= [tokenize.word_tokenize(x) for x in corpus] words = [y for x in data["tokens"] for y in x] word_freq = collections.Counter(words) print(word_freq.most_common(30))
Our goal is to train a set of word embeddings for the corpus above.
We will a skip-gram model with negative sampling.
The next step consist on creating a class that represent a word token. This class will the text and the count of the word in the corpus.
class Word: def __init__(self, word): self.word = word self.count = 0
next we will build the vocabulary for the whole corpus
vocabulary = {} word_count = 0; for line in data['tokens']: for token in line: if token not in vocabulary: vocabulary[token]=Word(token) vocabulary[token].count += 1 word_count += 1 if word_count % 1000000 == 0: print("\processed %d words" % word_count) word_count += 2 print('Total words in corpus: %d' % word_count) print('Vocabulary size: %d' % len(vocab_items))
We will now train the neural network of the word2vec model. Let’s define our model parameters:
- dim = dimension of the word vectors, typical values are 50, 100 or 300 bur we can choose the value we want
- win = context window size (number of tokens inside the window)
- start_alpha = starting learning rate
- neg = number of samples for negative sampling
- min_count = minimum number of mentions for a word to be included in vocabulary
dim = 100 win = 10 start_alpha = 0.05 neg = 10 min_count = 5
We will filter out rare words that have few mentions than our min_count threshold. We will be mapping all of these words to a special out-of-vocabulary token.
</pre><pre># truncate dictionary and map rare words to <unk> token reduced_dic = [] reduced_dic.append(Word('<unk>')) unk_hash = 0 count_unk = 0 for k, word in vocabulary.items(): if word.count < min_count: count_unk += 1 reduced_dic[unk_hash].count += word.count else: reduced_dic.append(word) reduced_dic.sort(key=lambda word : word.count, reverse=True) vocab_hash = {} for i, word in enumerate(reduced_dic): vocab_hash[word.word] = i vocabulary = reduced_dic vocab_hash = vocab_hash vocab_size = len(vocabulary) print('Unknown vocab size:', count_unk) print('reduced_dic vocab size: %d' % vocab_size)
Negative sampling
As seen in the part 1 of this tutorial, the probability for picking a word is equal to the number of times this word appears in the corpus, divided the total number of words in the corpus. This is expressed by the following equation:
The original authors of Word2vec paper tried a number of variations on this equation, and the best variation was to raise the word counts to the 3/4 power:
# Create table of probabilities for negative sampling exponent = 0.75 normlization_factor = sum([math.pow(t.count, exponent) for t in vocabulary]) # Normalizing constant table_size = int(1e8) # Length of the unigram table table = np.zeros(table_size, dtype=np.int) p = 0 # Cumulative probability i = 0 for j, unigram in enumerate(vocabulary): p += float(math.pow(unigram.count, exponent))/normlization_factor while i < table_size and float(i) / table_size < p: table[i] = j i += 1
The following function implement the roulette principle mentioned before, which is just randomly selecting words from the table generated by the code above.
def sample(table,count): indices = np.random.randint(low=0, high=len(table), size=count) return [table[i] for i in indices]
Training the Model
We are now ready to train the word2vec model. The approach is to train the two-layer (syn0, syn1) neural network by iterating over the sentences in our corpus and adjusting the weights to maximize the probabilities of context words given a target word (skip-gram) with negative sampling.
As the input vector is mainly zero and the negative samples allow only for small number of weights to be updated the operations need only to perform a matrix multiplication for each negative sample and for the positive word.
We will first initialize the syn1 and syn2 matrices:
<pre>import struct # Sigmoid Function def sigmoid(z): if z > 6: return 1.0 elif z < -6: return 0.0 else: return 1 / (1 + math.exp(-z)) # Init syn0 with uniform distribution on the interval [-0.5, 0.5]/dim tmp = np.random.uniform(low=-0.5/dim, high=0.5/dim, size=(vocab_size, dim)) syn0 = np.ctypeslib.as_ctypes(tmp) syn0 = np.array(syn0) tmp = np.zeros(shape=(vocab_size, dim)) syn1 = np.ctypeslib.as_ctypes(tmp) syn1 = np.array(syn1)</pre>
the code for creating the embedding is below:
current_sent = 0 truncated_vocabulary = [x.word for x in vocabulary_items] corpus = data['tokens'].tolist() while current_sent < data.count()[0]: line = corpus[current_sent] sent = [vocab_hash[token] if token in truncated_vocabulary else vocab_hash['<unk>'] for token in line] for sent_pos, token in enumerate(sent): current_win = np.random.randint(low=1, high=win+1) context_start = max(sent_pos - current_win, 0) context_end = min(sent_pos + current_win + 1, len(sent)) context = sent[context_start:sent_pos] + sent[sent_pos+1:context_end] for context_word in context: embed = np.zeros(dim) classifiers = [(token, 1)] + [(target, 0) for target in table[np.random.randint(len(table), size=neg)]] for target, label in classifiers: z = np.dot(syn0[context_word], syn1[target]) p = sigmoid(z) g = start_alpha * (label - p) embed += g * syn1[target] syn1[target] += g * syn0[context_word] syn0[context_word] += embed word_count += 1 current_sent += 1 if current_sent % 2000 == 0: print("\rReading sentence %d" % current_sent) embedding = dict(zip(truncated_vocabulary,syn0)) print("Trained embeddings")
the following excerpt show how the updating of the weights is done:
for context_word in context: embed = np.zeros(dim) classifiers = [(token, 1)] + [(target, 0) for target in table[np.random.randint(len(table), size=neg)]] for target, label in classifiers: z = np.dot(syn0[context_word], syn1[target]) p = sigmoid(z) g = start_alpha * (label - p) embed += g * syn1[target] syn1[target] += g * syn0[context_word] syn0[context_word] += embed
To make sure that our embeddings make sense, let’s examine the cosine similarity between two similar words such as king and queen, and two dissimilar words (man, kettle). We would expect the similar words to exhibit higher similarity.
</pre> from sklearn.metrics.pairwise import cosine_similarity print(cosine_similarity([embedding['king']],[embedding['queen']])) print(cosine_similarity([embedding['man']],[embedding['kettle']])) </div>
And that’s it. Now, this is just to get a better understanding about Word2Vec, if you want to train embedding vectors, you’d better use libraries such as gensim.
the source code as well as the data used in this tutorial can be downloaded from my Github repository here.