Word2Vec: an introduction

[This is a practical tutorial about Word2Vec. If you don’t want to copy-paste all the code, you can download the corresponding IPython notebook here]

In the past two years, Word2Vec, developed by Mikolov et al. (2013), a tool for learning continuous word embeddings from raw text has gained a lot of traction. Word2Vec attempts to associate words with points in space. The spatial distance between words then describes the relation (similarity) between these words. Words that are spatially close, are similar. Words are represented by continuous vectors over x dimensions. This example shows the relation between a number of words where each word is represented by a vector of two dimensions:

import seaborn as sb  
import numpy as np

words = ['queen', 'book', 'king', 'magazine', 'car', 'bike']  
vectors = np.array([[0.1,   0.3],  # queen  
                    [-0.5, -0.1],  # book
                    [0.2,   0.2],  # king
                    [-0.3, -0.2],  # magazine
                    [-0.5,  0.4],  # car
                    [-0.45, 0.3]]) # bike

sb.plt.plot(vectors[:,0], vectors[:,1], 'o')  
sb.plt.xlim(-0.6, 0.3)  
sb.plt.ylim(-0.3, 0.5)  
for word, x, y in zip(words, vectors[:,0], vectors[:,1]):  
    sb.plt.annotate(word, (x, y), size=12)

The displacement vector (the vector between two vectors) describes the relation between two words. This makes it possible to compare displacement vectors to find pairs of words that have a similar relation to each other. A famous example given in the original paper is the following analogy relation: queen : king :: woman : man which should be read as queen relates to king in the same way as woman relates to man. In algebraic formulation: v[queen] − v[king] = v[woman] − v[man]. This technique of analogical reasoning can be applied to e.g. question answering.

Word2Vec learns continuous word embeddings from plain text. But how? The model assumes the Distributional Hypothesis that words are characterized by words they hang out with. We can use that idea to estimate the probability of two words occurring near each other, e.g. what is the probability of the following words, given Cinderella, i.e $P(w|Cinderella)$?

import pandas as pd

s = pd.Series([0.1, 0.4, 0.01, 0.2, 0.05],  
              index=["pumpkin", "shoe", "tree", "prince", "luck"])

Softmax Regression

Word2Vec is a very simple neural network with a single hidden layer. Have a look at the following picture. We’ll get into more details in a moment.

The model considers each word wo in turn along with a given context C (e.g. $w_O$ = Cinderella and $C$ = shoe). Now given this context, can we predict what $w_O$ should be? This is essentially a multiclass classification where we have as many labels as our vocabulary size $V$. Using softmax regression, we can compute a probability distribution $\hat{y}$ over the labels. The model attempts to minimize via Stochastic Gradient Descent the difference between the output distribution and the target distribution (which is a one-hot distribution which places all probability mass on the correct word). The difference between the two distribution is measured by the cross-entropy.

The neural network contains two matrics: $W$ and $W′$ of dimensions $V \times N$ and $N \times V$ respectively, where $V$ is the vocabulary size and $N$ the number of dimensions. Let’s make this all a little more concrete with a small example. Say we have a corpus containing the following documents:

sentences = [  
    'the king loves the queen', 
    'the queen loves the king',
    'the dwarf hates the king', 
    'the queen hates the dwarf',
    'the dwarf poisons the king', 
    'the dwarf poisons the queen']

We first transform these documents into bag-of-indices to enable easier computation:

from collections import defaultdict

def Vocabulary():  
    dictionary = defaultdict()
    dictionary.default_factory = lambda: len(dictionary)
    return dictionary

def docs2bow(docs, dictionary):  
    """Transforms a list of strings into a list of lists where 
    each unique item is converted into a unique integer."""
    for doc in docs:
        yield [dictionary[word] for word in doc.split()]

vocabulary = Vocabulary()  
sentences_bow = list(docs2bow(sentences, vocabulary))  

We now construct the two matrices $W$ and $W′$:

import numpy as np

V, N = len(vocabulary), 3  
WI = (np.random.random((V, N)) - 0.5) / N  
WO = (np.random.random((N, V)) - 0.5) / N  

Each row $i$ in $W$ corresponds to word $i$ and each column $j$ corresponds to the $j$-th dimension. Note that $W′$ isn’t simply the transpose of $W$ but a different matrix.

With the two matrices in place we continue with computing the posterior probability of an output word given some input word. Given an input word $w_I$, e.g. dwarf and its corresponding vector $W_I$, what is the probability that the output word $w_O$ is hates? Using the dot product $W_I \cdot W′^T_O$ we compute the distance between the input word dwarf and the output word hates:

print np.dot(WI[vocabulary['dwarf']],  

which should print somethings like -0.0073. Now using softmax regression, we can compute the posterior probability P(w_O|w_I):

$$P(w_O|w_I) = y_i = \frac{exp(W_I \cdot W’^T_O)}{\sum^V_{j=1} exp(W_I \cdot W’^T_j)}$$

p = (np.exp(-np.dot(WI[vocabulary['dwarf']],  
                   WO.T[vocabulary['hates']])) /
         for w in vocabulary))
print p  

which will print something like: 0.14.

Updating the hidden-to-output layer weights

Word2Vec attempts to associate words with points in space. These points in space are represented by the continuous embeddings of the words. All vectors are initialized as random points in space, so we need to learn better positions. The model does so by maximizing the equation above. The corresponding loss function which we try to minimize is $E=−\log P(w_O|w_I)$. First, let’s focus on how to update the hidden-to-output layer weights. Say the target output word is Cinderella. Given the aformentioned one-hot target distribution $t$, the error can be computed as $t_j−y_j=e_j$, where $t_j$ is 1 iff $w_j$ is the actual output word. So, the actual output word is Cinderella and we compute the posterior probability of $P(pumpkin | tree)$, the error will be $0 – P(pumpkin | tree)$, because pumpkin isn’t the actual ouput word.

To obtain the gradient on the hidden-to-output weights, we compute $e_j \cdot h_i$, where $h_i$ is a copy of the vector corresponding to the input word (only holds with a context of a single word). Finally, using stochastic gradient descent, with a learning rate $\nu$ we obtain the weight update equation for the hidden to output layer weights:

$$W’^{T (t)}_j = W’^{T (t-1)}_j – \nu \cdot e_j \cdot h_j$$

Assume the target word is king and the context or input word $C$ is queen. Given this input word we compute for each word in the vocabulary the posterior probability $P(word | queen)$. If the word is our target word, the error will be $1 – P(word | queen)$; otherwise $0 – P(word | queen)$. Finally, using stocastic gradient descent we update the hidden-to-output layer weights:

target_word = 'king'  
input_word = 'queen'  
learning_rate = 1.0

for word in vocabulary:  
    p_word_queen = (
                      WI[vocabulary[input_word]])) / 
            for w in vocabulary))
    t = 1 if word == target_word else 0
    error = t - p_word_queen
    WO.T[vocabulary[word]] = (
        WO.T[vocabulary[word]] - learning_rate * 
        error * WI[vocabulary[input_word]])
print WO  

Updating the input-to-hidden layer weights

Now that we have a way to update the hidden-to-output layer weights, we concentrate on updating the input-to-hidden layer weights. We need to backpropagate the prediction errors to the input-to-hidden weights. We first compute $EH$ which is an $N$ dimensional vector representing the sum of the hidden-to-output vectors for each word in the vocabulary weighted by their prediction error:

$$\sum^V_{j=1} e_j \cdot W’_{i,j} = {EH}_i$$

Again using the learning rate $\nu$ we update the weights using:

$$W^{(t)}_{w_I} = W^{(t-1)}_{w_I} – \nu \cdot EH$$

Let’s see how that works in Python:

WI[vocabulary[input_word]] = WI[vocabulary[input_word]] - learning_rate * WO.sum(1)  

If we now would recompute the probability of each word given the input word queen, we see that the probability of king given queen has gone up:

for word in vocabulary:  
    p = (np.exp(-np.dot(WO.T[vocabulary[word]],
                       WI[vocabulary[input_word]])) / 
             for w in vocabulary))
    print word, p

Multi-word context

The model described above is the CBOW architecture of Word2Vec. However, we assumed that the context $C$ was only a single input word. This allowed us to simply copy the input vector to the hidden layer. If the context $C$ comprises multiple words, instead of copying the input vector we take the mean of their input vectors as our hidden layer:

$$h = \frac{1}{C} (W_1 + W_2 + \ldots + W_C)$$

The update functions remain the same except that for the update of the input vectors, we need to apply the update to each word in the contect $C$:

$$W^{(t)}_{w_I} = W^{(t-1)}_{w_I} – \frac{1}{C} \cdot \nu \cdot EH$$

Let’s see that in action. Again assume the target word is king. The context consists of two words: queen and loves.

target_word = 'king'  
context = ['queen', 'loves']  

We first take the average of the two context vectors:

h = (WI[vocabulary['queen']] + WI[vocabulary['loves']]) / 2  

Then we apply the hidden-to-output layer update:

for word in vocabulary:  
    p = (np.exp(-np.dot(WO.T[vocabulary[word]], h)) / 
         sum(np.exp(-np.dot(WO.T[vocabulary[w]], h)) 
             for w in vocabulary))
    t = 1 if word == target_word else 0
    error = t - p
    WO.T[vocabulary[word]] = (
        WO.T[vocabulary[word]] - learning_rate * 
        error * h)
print WO  

Finally we update the vector of each input word in the context:

for word in context:  
    WI[vocabulary[word]] = (
        WI[vocabulary[word]] - (1. / len(context)) *
        learning_rate * WO.sum(1))

And that’s it! (Well… to really apply this learning algorithm to large collections of text, we need to add some optimization because our implementation will be incredibly slow. Fortunately the paper describes some really clever ways to speed up the computation such as hierarchical soft-max and negative sampling. These topics are however beyond the scope of this introduction.