What happens when the CEO of Luminoso comes to my office and asks, "Can we do Arabic?"
In general, when people ask me whether Luminoso's software can handle a language we don't yet support – Estonian, Esperanto, Klingon, what have you – my answer is always "Yes, of course". Admittedly, I follow this up with "That is to say, you can put it into the system and see what happens" ... which is my answer because "handling" a language involves a number of complicated factors. We'd like to have some background knowledge in the language, and we'd like a word frequency list (see our chief science officer Robyn Speer's blog post from earlier this month for more on that topic).
But the thing we need most is software to parse the text: to break it up into words and to give us base forms we can use to represent those words. Without that, analysts are left looking at our software and thinking, "Well, here's what e-book users say about 'reading', and here's what they say about 'read', and here's what they say about 'reads', and ... why are these different concepts?". Of course, they're not different concepts, but if you did put Klingon into our system, it wouldn't know that be'Hom and be'Hompu' are the same concept. (Those mean "girl" and "girls". I had to look them up.) You would still find insights – you'd probably learn that "battle" and "happiness" are closely related in Klingon – they just wouldn't be quite as solid as they would be if we had a parser.
So when the CEO comes to my office and asks, "Can we do Arabic?", I give this explanation, ending with something like "So all we would need is software that can convert plurals to singulars and so forth." At which point she says to me, "Terrific! Get right on that" – and I am reminded that talking to your CEO is different than talking to most other people. (Of course, to be fair, she knew we already have software that would do most of the work; my real task would be evaluating it and working around any idiosyncrasies I found.)
In truth, though, while the project looked daunting, it also looked exciting. Developing Russian for our product was an interesting journey, but in some ways a very familiar one. Russian has a different alphabet, but like English it forms plurals by putting a suffix on a noun, and forms tenses and other verb variations by putting a suffix on a verb, and so forth. All a parser has to do is recognize the word, take some letters off the end, and voilà: a root word that represents the base concept! Arabic doesn't work that way at all.
How does Arabic work?
It turns out that there were two basic challenges to parsing Arabic, and its approach to suffixes was only the first one.
Take the Arabic root كتب, which is just the three consonants k, t, and b. It means "write", and interspersing certain vowels will give you the words for "he wrote" (kataba), or "he writes" (yaktubu), or even "he dictates", along with other vowels for the "I" form, the "you" form, and so forth. Add different vowels and you get a slew of related nouns: "book" (kitaab) or "library" (maktaba) or "office" (maktab)...to say nothing of the vowels you would change those to if you wanted a plural like "books" (kitub) or "offices" (makatib). All of which would be complicated enough, except that outside of the Qur'an, most of the vowels are almost never written, leaving a parser to reconstruct "yaktubu" from just "yktb", and to know that "ytkb" is the same concept as the verb "write" but not the noun "book". This bears so little relation to English or French or Russian that I hesitated to even believe anyone could write a parser to handle it.
Fortunately, I didn't have to write the parser; once I had one that worked, I would merely need to offer some guidance, correct it when it went astray, and decide which of its many outputs I wanted (yaktubu? yktb? ktb? something in between?). Unfortunately, the language's rules for word formation was only the first problem; my second problem was that no one speaks Arabic.
Now, obviously that can't be true; with over 240 million speakers, Arabic is the fifth most spoken language in the world. It turns out, however, that what no one speaks is standard Arabic – that is, Modern Standard Arabic, or MSA. When speaking formally or in an international setting, as at the United Nations or on Al-Jazeera, speakers do indeed use this standard form. Outside of such settings, speakers use their local dialect: Moroccan, Sudanese, Egyptian, Levantine, and many others, and that extends to writing, especially in online forums like Twitter. Often the local written form matches the local spoken form – not unknown in online English, where someone might write "deez" instead of "these", but much more common in written Arabic, and in this case rather than getting a nonsense word from a small variation in the spelling of "these", you get a word meaning "delirious". (Which actually happens.)
Early in the career of a computational linguist, you learn that most language-processing systems are designed to work on standard versions of languages: a French parser may not handle quirks of Québecois French, an English parser probably used news articles as training data and won't know many of the words it sees on Twitter. Any Arabic parser would similarly be based on Modern Standard Arabic; could it be convinced to handle dialects?
Of course, there was also a third problem I haven't even mentioned: I don't speak Arabic. But here at Luminoso, we don't let minor technicalities stop us, so we contracted a native speaker to help me, I downloaded a few apps to teach me the alphabet, and off we went.
What a parser can (and can't) do
On the bright side, writing a program to parse Arabic wouldn't really be my job; I only needed to evaluate the ones available and build on those. Some initial exploration suggested that pretty good parsers did indeed already exist. All the same, putting Arabic in our system wouldn't be as simple as dropping one into our software and letting it roam free.
Many Arabic parsers are built on the grammatical structures seen in the Qur'an, which is written in language essentially the same as Modern Standard Arabic. Therefore, they may classify the prefix "l-" as ambiguous between the preposition "to" and an indicator of emphasis on the noun, but the latter is only used in literary Arabic (for instance, the Qur'an). We had to tell our software that if the parser categorized anything as "emphatic particle", it should go back and find another option.
But there were other, subtler problems inherent to the nature of Arabic grammar. An "a-" prefix on a verb might indicate a causative form; it's this form that turns "he writes" into "he dictates" (i.e., he causes someone to write), or "to know" into "to inform" (i.e., to cause someone to know something). On the other hand, an "a-" prefix can also indicate that "I" is the subject of the verb. A good Arabic parser may return both alternatives, but we found that we couldn't necessarily rely on our parser to guess which right in a particular sentence. For this, I had to sit down with our native speaker and simply look at a lot of sentences and their parses, asking for each, "Did the parser return the right result here? What about here? If the result was wrong, was it at least a reasonable interpretation in context, or can we determine which result we wanted?"
In the end, we did have to accept some limitations of the parser. The Arabic word ما ("maa") means "what", but it is also used for negation in some circumstances, and deciding which as which proved too difficult for the computer. You see ambiguity in all languages, of course: in English, "can" might mean "is able to", in which case it's an ignorable common word, or it might mean "metal container", in which case we wouldn't want to ignore it. But most cases are easy to distinguish--you don't even need the whole sentence to know which "can" is which in the phrases "the can" or "can see". In this case, where both meanings are common function words, it became much harder to get reliable results.
The dialect problem never went away, but we did learn to minimize its effects. We included several common dialect spellings of function words on our "words to ignore" list, so that even if the parser thought they were nouns or verbs, we knew to skip them in our analysis. And we found that in an international data set like hotel reviews, there was enough Modern Standard Arabic for us to successfully gain insights from it. I'd want to fine-tune the program before loading, say, thousands of sentences of a single dialect, especially if that dialect varies significantly from the standard (Tunisian Arabic, for example, has influences from several European and African languages), but after the development we've already done, I'd be confident in our ability to do that fine-tuning.
A final unexpected challenge came when we looked at the results in our visualizer: many things were backwards! Not the words, fortunately, but arrows would point in the wrong direction, text would align flush against the wrong edge, even quotation marks would appear at the wrong edge of the text. It turns out that many, many programs, including web browsers, simply despair when you mix text that reads left-to-right (like English) with text that reads right-to-left (like Arabic).
It's as confusing as it sounds.
That one turned out to be far easier to fix than we expected: style sheets for web pages allow you to specify that the direction of the text is right-to-left, at which point the browser everything flips to look the way it should.
In the end, I'm quite pleased at how well our system handles Arabic. Starting as a task that I knew would be hard and I feared would be simply impossible, this project has ended with the ability to find insights in Arabic text that I'd readily put up against our French or Russian capabilities. I can now tell people that I've taught a computer to understand Arabic, which may be an exaggeration, but it does still understand more Arabic than I do.
Adding Arabic also means that we can now find insights in the language of nearly 40% of the world's population, including all six languages of the United Nations; and that we cover four of the five most spoken languages in the world– and who knows, perhaps Hindi will be next (unless Klingon turns out have higher demand than I anticipated, in which case, Heghlu'meH QaQ jajvam).