Do you speak Yalunka? Your computer soon might along with 1000s of other rare languages

2017-05-11

An interesting-looking new machine translation technique that takes grammar into consideration by Ehsaneddin Asgari and Hinrich Schütze. Excerpt from a summary on Technology Review:

This data set is not big enough for the kind of industrial machine learning that Google and others perform. So Asgari and Schutze have come up with another approach based on the way tenses appear in different languages.

Most languages use specific words or letter combinations to signify tenses. So the new trick is to manually identify these signals in several languages and then use data-mining techniques to hunt through other translations looking for words or strings of letters that play the same role.

For example, in English the present tense is signified by the word “is,” the future tense by the word “will,” and the past tense by the word “was.” Of course, there are other signifiers too.

Asgari and Schutze’s idea is to find all these words in the English translation of the Bible along with other examples from a handful other language translations. Then look for words or letters strings that play the same role in other languages. For example, the letter string “-ed” also signifies the past tense in English.

But there is a twist. Asgari and Schutze do not start with English because it is a relatively old language with many exceptions to the rule, which makes it hard to learn.

Instead, they start with a set of Creole languages that have developed from a mixture of other languages. Because they are younger, Creole languages have had less time to develop these linguistic idiosyncrasies. And that means they generally contain better markers of linguistic features such as tense. “Our rationale is that Creole languages are more regular than other languages because they are young and have not accumulated ‘historical baggage’ that may make computational analysis more difficult,” they say.

One of these languages is Seychelles Creole, which uses the word “ti” to signify the past tense. For example, “mon travay” means “I work” in this language, while “mon ti travay” means “I worked” and “mon ti pe travay” means “I was working.” So “ti” is a good signifier of past tense.

Asgari and Schutze compile a list of past tense signifiers in 10 other languages and then mine the Parallel Bible Corpus for other words and letter strings that perform the same function. They repeat this for the present tense and future tense.

The results make for interesting reading. The technique reveals linguistics constructions related to tense in common languages such as “-ed” in English and “-te” in German, as well as the words and phrases that perform the same functions in much less common languages such as the past tense signifier “den” in the Gourmanchema language from Burkino Faso, and “yi” in Yalunka, spoken in Mali, and so on.

This work allows the researchers to create maps showing how languages using similar tense constructions are related (see diagram).

That’s interesting work. Asgari and Schutze have developed a computational method to analyze the way people use the past, present, and future tense in over 1,000 languages. This is the largest cross-language computational study ever undertaken. Indeed, the number of languages involved is an order of magnitude greater than in other studies.

The full paper is on arXiv here.