July 16, 2006

The Awful Turkish Language

I wrote this about a year and a half ago; I'm cleaning out my drafts folder.

There are reasons why one might think that Turkey should not be admitted to the European Union, but surely the silliest must be that Turkish is not an Indo-European language. Following Phersu, I can just imagine the consequences of taking this seriously. First, the Basque-speaking provinces of France and Spain leave the EU, along with Hungary, Finland, Estonia and Malta. But then, of course, India and Pakistan will submit rival applications to join, closely followed, no doubt, by the Iraqi Kurds. The whole idea is so stupid that I can't believe it was meant seriously, or even guess what Giscard d'Estaing thought "Indo-European" meant.

That said, Turkish does have features which are absent or attenuated in (most) Indo-European languages. (Disclaimer: I do not speak Turkish.) For instance, it's highly agglutinative, forming new words by adding suffixes to roots, and doing so recursively. (German does this too, but to nowhere near the same degree.) This leads to words like yapabilecekdiyseniz, "if you were going to be able to do". (Readers may amuse themselves by analyzing this example using the Turkish Suffix Dictionary.) Moreover, these words are not oddities, like "antidisestablishmentarianistic", but in everyday use. I once heard a talk by a computational linguist specializing in Turkish — Gerjan van Schaaik, who oddly seems to have no web presence — where he mentioned that if one studied the corpus of Turkish daily newspapers, one could easily build a lexicon of 500,000 entries, and still cover only 95% of the words in the corpus. (I can't tell, from my notes, whether van Schaaik was talking about something that had actually been done, or just making a rough estimate.) This property of Turkish becomes very important for a number of technologies, including one without which the modern world would simply grind to a halt: spam filtering.

Levent Özgür, Tunga Güngör and Fikret Gürgen, "Adaptive anti-spam filtering for agglutinative languages: a special case for Turkish", Pattern Recognition Letters 25 (2004): 1819--1831 [PDF reprint via Prof. Güngör]
Abstract: We propose anti-spam filtering methods for agglutinative languages in general and for Turkish in particular. The methods are dynamic and are based on Artificial Neural Networks (ANN) and Bayesian Networks. The developed algorithms are user-specific and adapt themselves with the characteristics of the incoming e-mails. The algorithms have two main components. The first one deals with the morphology of the words and the second one classifies the e-mails by using the roots of the words extracted by the morphological analysis. Two ANN structures, single layer perceptron and multi-layer perceptron, are considered and the inputs to the networks are determined using binary model and probabilistic model. Similarly, for Bayesian classification, three different approaches are employed: binary model, probabilistic model, and advanced probabilistic model. In the experiments, a total of 750 e-mails (410 spam and 340 normal) were used and a success rate of about 90% was achieved.

Özgür et al. do not report on the ability of their classifiers to discriminate between spam, and weirdly pseudo-learned pronouncements from former presidents of France.

Enigmas of Chance

Posted at July 16, 2006 04:59 | permanent link

Three-Toed Sloth