Orphans of the Arabian Nights

The Arabian Nights is a tremendously rich document containing centuries of cultural thought, ideas and history. The Nights contains stories from an enormous geographical region: from Morocco to Ethiopia to modern Iraq to India. In this context it is no surprise to find an overwhelming number of different topics and subjects in the stories, yet they form a remarkably coherent collection.

In her wonderful Stranger Magic (go read it, if you haven’t), Marina Warner (2012) notes that a number of very famous stories in the Nights (‘Aladdin and the Wonderful Lamp’ and ‘Ali Baba and the Forty Thieves’) are not originally part of the Night but are ‘orphan’ tales probably to be contributed to the French translator Galland. Evidence for this claim is found in the first Arabic version of Aladdin which can be back-traced quite directly to Galland’s writings in French. Another sign of his “bricolage” as Warner calls it, is that the story of Aladdin

“pieces and patches many elements from different tales in the book, especially from ‘the true Aladdin’ (‘Aladdin of the Beautiful Moles’) and ‘Hasan of Basra’. […] Yet the plot of ‘Aladdin’, which upholds the rise of a worthless orphan boy to princely fortune, fame and power, oddly replicates the fate of the book itself, as does the story of Morgiana the plucky slave girl in ‘Ali Baba’, for she too marries up; it is as if Galland were unconsciously confessing his own craft and luck.” (Warner 2012, pp. 58).

In this post I want to explore the Nights from a distance. More specifically, using a technique called Topic Modeling, I want to investigate this idea that the “bricolage” of Galland can be observed from the many pieces and patches, or as I will call them, topical connections, between the orphan stories and the rest of the collection.

Over the last 10 years, Topic Modeling gained a lot of attention in Machine Learning and Information Retrieval. This technique allows researchers to browse a collection, not on the basis of single words, but on the basis of topics, such as ‘love’, ‘despair’, ‘war’ or ‘magic’. Scholars from the Humanities increasingly show interest in using these techniques although they also show a healthy skepticism towards the meaningfulness of applying these methods. The topic models generally provide information that most scholars are already aware of, because the topics are often of a very general nature. Although I wholeheartedly agree with these objections, I do find that a distant view on a sufficiently large collection can provide insights about the data and sometimes even proof for certain hypotheses, that are otherwise hard to obtain.

Constructing the Topic Model

There are quite some Topic Modeling toolkits available. One of the best toolkits that is also quite easy to use is the one that is included in Mallet (MAchine Learning for LanguagE Toolkit). I used the modern English translation of the nights by Malcolm Lyons (2009) from the Penguin Classics Series. Contrary to popular belief, the 1001 nights do not contain 1001 stories. There are about 260 different stories told over 1001 nights. Interestingly, night 261 seems to be missing from the Lyons edition. I constructed a corpus from Lyons’ (2009) edition that consists of 1000 documents (one for each night) plus the orphan stories of ‘Aladdin’ and ‘Ali Baba’.

I then run the Mallet Topic Model using 300 topics. The number of topics is always somewhat of a black art, and you need to experiment with a number of settings. The general idea is that if you use only a few topics, the model will provide a very general view on the data. If you choose many topics, you will obtain many fine-grained topical differences. The risk of having too many topics is that you lose generalization. For most corpora, 200 to 400 topics seems to be a good number.

Here are some of the topics learned by the Topic Model:

  • Topic 222: god, night, pray, men, prayer, grant, pious, blessing, granted, prayers;
  • Topic 209: ship, sea, captain, island, board, shore, sailed, water, city, wind;
  • Topic 155: fish, fisherman, sea, net, baker, water, cast, bread, give, day;
  • Topic 114: men, thousand, muslims, fight, army, riders, battle, killed, swords, infidels.

These topics seem to deal with religion, marine, fishing and war. Some topics are rather general and appear throughout the Nights. The following plot visualizes the ten most common topics in the nights as an area chart. The nights are on the x-axis and the probability of a topic occurring in a particular night is on the y-axis:

  • Topic 59: god, heard, asked, told, replied, don’t, hand, morning, made, night;
  • Topic 143: back, left, put, find, happened, leave, afraid, heard, good, thought;
  • Topic 111: night, morning, hundred, told, continued, heard, broke, king, fortunate, allowed;
  • Topic 121: sight, found, left, started, time, clothes, day, fell, walked, looked;
  • Topic 114: men, thousand, muslims, fight, army, riders, battle, killed, swords, infidels;
  • Topic 48: man, asked, told, back, gave, replied, shop, bring, home, don’t;
  • Topic 72: king, palace, ground, state, city, kissed, ordered, emirs, son, throne;
  • Topic 30: gharib, ajib, sahim, mirdas, brother, hundred, friend, al-kailajan, abraham, replied;
  • Topic 5: great, gave, time, men, brought, filled, honour, taking, joy, provided;
  • Topic 154: tears, left, time, god, recited, life, heart, lines, back, wept.

The most general and common topic in the Nights is about communication. No surprise. Other topics, such as topic 114 deal with war. The plot nicely visualizes where in the Nights wars are fought. These general topics don’t tell us much, but do function as a sanity check that our models is capable of finding common topics. Let’s now have a look at some relatively common topics that display a higher degree of granularity:

This mountainous landscape shows some interesting peaks of topic usage in certain nights. For example, around night 357-370, the Topic 204 shows a big burst. This topic has the following top words: prince, king, horse, princess, father, city, persian, palace, sorcerer, roof. In these nights Shahrazad tells the Sultan the story of the Ebony Horse. This story tells about a Persian Sage who brings an flying ebony horse to king Sabut. In return the king promises one of his daughters, but the princess is reluctant to marry this ugly and old man. A lot of adventures follow, but the point here is that, as you can see, many of the key words of the story are present in the topic. Topics such as Topic 204 are story-specific topics. Topic 2 is another example of this which shows a burst around night 550-600. During these nights Shahrazad tells the story of Sindbad. The words in Topic 2 provide somewhat of a summary of this story: sindbad, goods, island, voyage, god, large, friends, baghdad, sailor, merchants.

Aladdin & Ali Baba

According to Warner (2012), the stories of Aladdin and Ali Baba are in a way a reflection of the Nights and share many topics from many stories. Let’s have a look at some of the topics present in these two stories:

The most dominant topic in Ali Baba is Topic 201 which is represented by the following words: ali baba, captain, qasim, oil, wife, gold, jar, city, husain, abdullah, coin. These words beautifully summarize the story. Aladdin’s most probable topic is Topic 266 which contains the following words: aladdin, princess, palace, magician, sultan, mother, aladdin’s, grand, don’t, sultan’s, majesty. Again, the words seem to capture (some of) the essence of the story.

Reflections of the Nights?

How are Aladdin and Ali Baba related in terms of their topics to the rest of the Nights? To find that out we can use the topic distributions of all nights and the alleged orphans and compute the pairwise distances between them. This results in a matrix of distances between all stories. One downside of this is that is not straightforward to inspect such a large table. Therefore, I make use of another technique called t-SNE (developed by Van der Maaten & Hinton, info). This technique allows us to visualize the distances between the stories in a two-dimensional plot. To make a long story short, here it is (Right-click and open the image in a new tab, if it’s to small.):

The plot displays a number of clusters. Far to the right we have a cluster containing the nights in which the story of ‘Hasan of Basra’ is told. The blue cluster on the top contains the nights in which Shahrazad tells the story about ‘‘Ajib and Gharib’. Exactly why these nights are so far away from the other stories is something I would like to look into another time. For now it is intriguing to see that ‘Aladdin’ and ‘Ali Baba’ are placed right next to each other. What is even more striking is that they occupy the center of the plot, suggesting that the distances between them and the stories in the Nights is relatively small and that they have many topical connections with these other stories. Although, my analysis is in no way conclusive, given the central position of the orphan stories, it does make an argument for Warner’s idea that the “bricolage” of Galland can be observed from the many pieces and patches from the rest of the Nights.