Word2Vec: an introduction

[This is a practical tutorial about Word2Vec. If you don’t want to copy-paste all the code, you can download the corresponding IPython notebook here]

In the past two years, Word2Vec, developed by Mikolov et al. (2013), a tool for learning continuous word embeddings from raw text has gained a lot of traction. Word2Vec attempts to associate words with points in space. The spatial distance between words then describes the relation (similarity) between these words. Words that are spatially close, are similar. Words are represented by continuous vectors over x dimensions. This example shows the relation between a number of words where each word is represented by a vector of two dimensions:

import seaborn as sb  
import numpy as np

words = ['queen', 'book', 'king', 'magazine', 'car', 'bike']  
vectors = np.array([[0.1,   0.3],  # queen  
                    [-0.5, -0.1],  # book
                    [0.2,   0.2],  # king
                    [-0.3, -0.2],  # magazine
                    [-0.5,  0.4],  # car
                    [-0.45, 0.3]]) # bike

sb.plt.plot(vectors[:,0], vectors[:,1], 'o')  
sb.plt.xlim(-0.6, 0.3)  
sb.plt.ylim(-0.3, 0.5)  
for word, x, y in zip(words, vectors[:,0], vectors[:,1]):  
    sb.plt.annotate(word, (x, y), size=12)

The displacement vector (the vector between two vectors) describes the relation between two words. This makes it possible to compare displacement vectors to find pairs of words that have a similar relation to each other. A famous example given in the original paper is the following analogy relation: queen : king :: woman : man which should be read as queen relates to king in the same way as woman relates to man. In algebraic formulation: v[queen] − v[king] = v[woman] − v[man]. This technique of analogical reasoning can be applied to e.g. question answering.

Word2Vec learns continuous word embeddings from plain text. But how? The model assumes the Distributional Hypothesis that words are characterized by words they hang out with. We can use that idea to estimate the probability of two words occurring near each other, e.g. what is the probability of the following words, given Cinderella, i.e $P(w|Cinderella)$?

import pandas as pd

s = pd.Series([0.1, 0.4, 0.01, 0.2, 0.05],  
              index=["pumpkin", "shoe", "tree", "prince", "luck"])
s.plot(kind='bar')  
sb.plt.ylabel("$P(w|Cinderella)$")  

Softmax Regression

Word2Vec is a very simple neural network with a single hidden layer. Have a look at the following picture. We’ll get into more details in a moment.

The model considers each word wo in turn along with a given context C (e.g. $w_O$ = Cinderella and $C$ = shoe). Now given this context, can we predict what $w_O$ should be? This is essentially a multiclass classification where we have as many labels as our vocabulary size $V$. Using softmax regression, we can compute a probability distribution $\hat{y}$ over the labels. The model attempts to minimize via Stochastic Gradient Descent the difference between the output distribution and the target distribution (which is a one-hot distribution which places all probability mass on the correct word). The difference between the two distribution is measured by the cross-entropy.

The neural network contains two matrics: $W$ and $W′$ of dimensions $V \times N$ and $N \times V$ respectively, where $V$ is the vocabulary size and $N$ the number of dimensions. Let’s make this all a little more concrete with a small example. Say we have a corpus containing the following documents:

sentences = [  
    'the king loves the queen', 
    'the queen loves the king',
    'the dwarf hates the king', 
    'the queen hates the dwarf',
    'the dwarf poisons the king', 
    'the dwarf poisons the queen']

We first transform these documents into bag-of-indices to enable easier computation:

from collections import defaultdict

def Vocabulary():  
    dictionary = defaultdict()
    dictionary.default_factory = lambda: len(dictionary)
    return dictionary

def docs2bow(docs, dictionary):  
    """Transforms a list of strings into a list of lists where 
    each unique item is converted into a unique integer."""
    for doc in docs:
        yield [dictionary[word] for word in doc.split()]

vocabulary = Vocabulary()  
sentences_bow = list(docs2bow(sentences, vocabulary))  

We now construct the two matrices $W$ and $W′$:

import numpy as np

V, N = len(vocabulary), 3  
WI = (np.random.random((V, N)) - 0.5) / N  
WO = (np.random.random((N, V)) - 0.5) / N  

Each row $i$ in $W$ corresponds to word $i$ and each column $j$ corresponds to the $j$-th dimension. Note that $W′$ isn’t simply the transpose of $W$ but a different matrix.

With the two matrices in place we continue with computing the posterior probability of an output word given some input word. Given an input word $w_I$, e.g. dwarf and its corresponding vector $W_I$, what is the probability that the output word $w_O$ is hates? Using the dot product $W_I \cdot W′^T_O$ we compute the distance between the input word dwarf and the output word hates:

print np.dot(WI[vocabulary['dwarf']],  
             WO.T[vocabulary['hates']])

which should print somethings like -0.0073. Now using softmax regression, we can compute the posterior probability P(w_O|w_I):

$$P(w_O|w_I) = y_i = \frac{exp(W_I \cdot W’^T_O)}{\sum^V_{j=1} exp(W_I \cdot W’^T_j)}$$

p = (np.exp(-np.dot(WI[vocabulary['dwarf']],  
                   WO.T[vocabulary['hates']])) /
     sum(np.exp(-np.dot(WI[vocabulary['dwarf']],
                       WO.T[vocabulary[w]])) 
         for w in vocabulary))
print p  

which will print something like: 0.14.

Updating the hidden-to-output layer weights

Word2Vec attempts to associate words with points in space. These points in space are represented by the continuous embeddings of the words. All vectors are initialized as random points in space, so we need to learn better positions. The model does so by maximizing the equation above. The corresponding loss function which we try to minimize is $E=−\log P(w_O|w_I)$. First, let’s focus on how to update the hidden-to-output layer weights. Say the target output word is Cinderella. Given the aformentioned one-hot target distribution $t$, the error can be computed as $t_j−y_j=e_j$, where $t_j$ is 1 iff $w_j$ is the actual output word. So, the actual output word is Cinderella and we compute the posterior probability of $P(pumpkin | tree)$, the error will be $0 – P(pumpkin | tree)$, because pumpkin isn’t the actual ouput word.

To obtain the gradient on the hidden-to-output weights, we compute $e_j \cdot h_i$, where $h_i$ is a copy of the vector corresponding to the input word (only holds with a context of a single word). Finally, using stochastic gradient descent, with a learning rate $\nu$ we obtain the weight update equation for the hidden to output layer weights:

$$W’^{T (t)}_j = W’^{T (t-1)}_j – \nu \cdot e_j \cdot h_j$$

Assume the target word is king and the context or input word $C$ is queen. Given this input word we compute for each word in the vocabulary the posterior probability $P(word | queen)$. If the word is our target word, the error will be $1 – P(word | queen)$; otherwise $0 – P(word | queen)$. Finally, using stocastic gradient descent we update the hidden-to-output layer weights:

target_word = 'king'  
input_word = 'queen'  
learning_rate = 1.0

for word in vocabulary:  
    p_word_queen = (
        np.exp(-np.dot(WO.T[vocabulary[word]],
                      WI[vocabulary[input_word]])) / 
        sum(np.exp(-np.dot(WO.T[vocabulary[w]],
                          WI[vocabulary[input_word]]))
            for w in vocabulary))
    t = 1 if word == target_word else 0
    error = t - p_word_queen
    WO.T[vocabulary[word]] = (
        WO.T[vocabulary[word]] - learning_rate * 
        error * WI[vocabulary[input_word]])
print WO  

Updating the input-to-hidden layer weights

Now that we have a way to update the hidden-to-output layer weights, we concentrate on updating the input-to-hidden layer weights. We need to backpropagate the prediction errors to the input-to-hidden weights. We first compute $EH$ which is an $N$ dimensional vector representing the sum of the hidden-to-output vectors for each word in the vocabulary weighted by their prediction error:

$$\sum^V_{j=1} e_j \cdot W’_{i,j} = {EH}_i$$

Again using the learning rate $\nu$ we update the weights using:

$$W^{(t)}_{w_I} = W^{(t-1)}_{w_I} – \nu \cdot EH$$

Let’s see how that works in Python:

WI[vocabulary[input_word]] = WI[vocabulary[input_word]] - learning_rate * WO.sum(1)  

If we now would recompute the probability of each word given the input word queen, we see that the probability of king given queen has gone up:

for word in vocabulary:  
    p = (np.exp(-np.dot(WO.T[vocabulary[word]],
                       WI[vocabulary[input_word]])) / 
         sum(np.exp(-np.dot(WO.T[vocabulary[w]],
                           WI[vocabulary[input_word]])) 
             for w in vocabulary))
    print word, p

Multi-word context

The model described above is the CBOW architecture of Word2Vec. However, we assumed that the context $C$ was only a single input word. This allowed us to simply copy the input vector to the hidden layer. If the context $C$ comprises multiple words, instead of copying the input vector we take the mean of their input vectors as our hidden layer:

$$h = \frac{1}{C} (W_1 + W_2 + \ldots + W_C)$$

The update functions remain the same except that for the update of the input vectors, we need to apply the update to each word in the contect $C$:

$$W^{(t)}_{w_I} = W^{(t-1)}_{w_I} – \frac{1}{C} \cdot \nu \cdot EH$$

Let’s see that in action. Again assume the target word is king. The context consists of two words: queen and loves.

target_word = 'king'  
context = ['queen', 'loves']  

We first take the average of the two context vectors:

h = (WI[vocabulary['queen']] + WI[vocabulary['loves']]) / 2  

Then we apply the hidden-to-output layer update:

for word in vocabulary:  
    p = (np.exp(-np.dot(WO.T[vocabulary[word]], h)) / 
         sum(np.exp(-np.dot(WO.T[vocabulary[w]], h)) 
             for w in vocabulary))
    t = 1 if word == target_word else 0
    error = t - p
    WO.T[vocabulary[word]] = (
        WO.T[vocabulary[word]] - learning_rate * 
        error * h)
print WO  

Finally we update the vector of each input word in the context:

for word in context:  
    WI[vocabulary[word]] = (
        WI[vocabulary[word]] - (1. / len(context)) *
        learning_rate * WO.sum(1))

And that’s it! (Well… to really apply this learning algorithm to large collections of text, we need to add some optimization because our implementation will be incredibly slow. Fortunately the paper describes some really clever ways to speed up the computation such as hierarchical soft-max and negative sampling. These topics are however beyond the scope of this introduction.

How long is an Arabian Night?

When the next night came, Dinarazad said to her sister Shahrazad: ‘In God’s name, sister, if you are not asleep, then tell us one of your stories!’ Shahrazad answered: ‘With great pleasure! I have heard tell, honoured King, that…’

I’m planning to do a series of blog posts about Alf Laylah Wa Laylah, the Stories of One Thousand and One Nights. This collection of folk tales, collected over many centuries by various authors, translators, and scholars across West, Central and South Asia and North Africa, forms a huge narrative wheel with an overarching plot, created by the frame story of Shahrazad.

The stories begin with the tale of king Shahryar and his brother, who, both deceived by their respective Sultanas, leave their kingdom, only to return when they have found someone who — in their view — was wronged even more. On their journey the two brothers encounter a huge jinn who carries a glass box containing a beautiful young woman. The two brothers hide as quickly as they can in a tree. The jinn lays his head on the girl’s lap and as soon as he is asleep, the girl demands the two kings to make love to her or else she will wake her ‘husband’. They reluctantly give in and the brothers soon discover that the girl has already betrayed the jinn ninety-eight times before. This exemplar of lust and treachery strengthens the Sultan’s opinion that all women are wicked and not to be trusted.

When king Shahryar returns home, his wrath against women has grown to an unprecedented level. To temper his anger, each night the king sleeps with a virgin only to execute her the next morning. In order to make an end to this cruelty and save womanhood from a “virgin scarcity”, Sharazad offers herself as the next king’s bride. On the first night, Sharazad begins to tell the king a story, but she does not end it. The king’s curiosity to know how the story ends, prevents him from executing Shahrazad. The next night Shahrazad finishes her story, and begins a new one. The king, eager to know the ending of this tale as well, postpones her execution once more. Using this strategy for One Thousand and One Nights in a labyrinth of stories-within-stories-within-stories, Shahrazad attempts to gradually move the king’s cynical stance against women towards a politics of love and justice (see Marina Warner’s Stranger Magic (2013)).

The first European version of the Nights was translated into French by Antoine Galland. Many translations (in different languages) followed, such as the (heavily criticized) English translation by Sir Richard Francis Burton entitled The Book of the Thousand and a Night (1885). This version is freely available from the Gutenberg project (see here), and will be the one I will explore here.

I am intrigued by the suspense created by Shahrazad’s story-telling skills, especially the “cliff-hanger” ending each night with she uses to avert her own execution (and possibly that of womanhood). Every night she tells the Sultan a story only to stop at dawn and she picks up the thread the next night. But does it really take the whole night to tell a particular story?

I am not aware of any exact numbers about how many words people speak per minute. Averages seem to fluctuate between 100 and 200 words per minute. Narrators are advised to use approximately 150 words per minute in audiobooks. I suspect that this number is a little lower for live storytelling and assume it lies around 130 words per minute (including pauses) (cf. this). Using this information, we can compute the time it takes to tell a particular story as follows:

$$ST(t) = \frac{\textrm{number of words in t}}{\textrm{words per minute}}$$

So, a story of 4000 words would take approximately 4000 / 130 = 30 minutes. (To be honest, this actually seems quite fast to me.) I took Burton’s translation of Alf Laylah Wa Laylah and computed for each night how long it would take to tell the story. The plot below visualizes this for each story. I add a smoothing curve (in blue) for interpretability. On average, each night only lasts nine minutes. Remarkably short!

Then Shahrazad reached the morning, and fell silent in the telling of her tale…