A tutorial on Automatic Language Identification - word based

This page deals with automatically classifying a piece of text as being a certain language. A training corpus is assembled which contains examples from each of the languages we wish to identify, then we use the training information to guess what language a set of test sentences is in.

This page will deal with word based methods of language identification, other methods include e.g. n-grams. Word based methods require a large text corpus from which to determine the words that are likely to occur in a language, as well as the relative frequencies of the words.

On this page we will focus on differentiating 5 European languages: Swedish, English, German, French and Dutch. The reason we choose European languages is because they are all based on words i.e. a sentence is composed of a sequence of letters with words separated by spaces. There are, after all, languages which do not conform to this convention e.g. written Chinese. This sort of thing plagues many language identification efforts, since seemingly universal properties of languages turn out not to be universal at all.

Another problem we could encounter is Unicode and character encoding, which in real systems can be quite a problem. For this page all of our text will be in UTF-8 which will make it easier, but if you want to release this algorithm in the wild you will need some pre-processing to get around encoding troubles or use a different approach.

Collecting a Corpus §

We will use the Wortschatz website to get our training material, in this case we will use 3 Million sentences from each language to generate word lists along with word counts. From the counts we will calculate frequencies. A few example sentences from our English corpus:

1   Boozer blocked the path in the lane and West leapt in the air.  
2   After a number of months I may reach Mexico -or I may not.      
3   Crews rescued a man after he reportedly attempted to climb down.

The processing we use involves a python script to strip out all the punctuation from the sentences and convert everything to uppercase. After this we count the occurance of each word that appears in each corpus. This involves using python's split() function to separate the sentences into words. The results of this is shown in the following wordcount files:

  1. German word counts: german_counts.txt
  2. English word counts: english_counts.txt
  3. French word counts: french_counts.txt
  4. Danish word counts: danish_counts.txt
  5. Swedish word counts: swedish_counts.txt

Words that appeared fewer than 2 times in our training corpus were not included in the word counts.

Computing Probabilities §

Using the word counts we generated in the previous section, we can calculate word frequencies. In the German corpus the word EIN occurs 345032 times, out of a total word count of 46387276. This means EIN occurs with a probability of 345032/46387276 = 0.0074. Alternatively we could say it constitutes around 0.74 percent of all german words. On the other hand, EIN occurs only 9 times out of 57699127 words in the English corpus. Using these probabilities we can calculate the probability that a sentence comes from a specific language.

To compute the total sentence probability, we multiply each of the word probabilities. Notice that when we multiply a lot of small probabilities our final result gets very small; for this reason we usually use log probabilities instead and add them (using the following identity: log(a*b) = log(a) + log(b)). Previously we calculated that EIN has a probability of 0.0074 in German, this corresponds to a log probability of log(0.0074) = -4.096. From here on we will use the log probabilities of the words.

Note that I have used the log base e for all the calculations here, it doesn't really matter which you use, log10 or log2 will work just as well, though your log likelyhoods will be slightly different the final accuracies should be exactly the same.

An Example: we wish to determine the probability that the sentence 'THE CAT PLAYS' comes from each of our 5 languages. Below are the log probabilities for each of the 3 words from our 5 languages:

GermanEnglishFrenchDanishSwedish
THE-9.21-2.76-9.01-8.54-8.87
CAT-13.7-10.83-13.27-12.81-12.13
PLAYS-15.34-9.41-16.04-15.25-15.56
total:-38.26-23.01-38.33-36.61-36.56

For English the sentence probability is (-2.76) + (-10.83) + (-9.41) = -23.01, which is a much higher probability than any of the other languages.

How do we calculate the probabilities of words that don't appear in our training corpus? After all, some words are just rare and even though they don't occur in our English training samples they may still be English words. We get around this by simply assigning a count of 1 to unseen words. This prevents our total sentence probability from becoming zero as soon as an unusual word is seen.

Testing our Model §

We will use 10,000 sentences (again from Wortschatz) from each of the 5 languages (50,000 sentences total) to test our model. Each of the test sentences are new i.e. they were not part of our training corpus. For each sentence, the words are split up individually and letters not in the following set are removed: ABCDEFGHIJKLMNOPQRSTUVWXYZÆØÅÄÖÜßÀÂÇÉÈÊËÎÏÔŒÙÛŸ. These constitute the allowable letters from our 5 languages.

To decide what language a sentence was in, we calculate the probability each sentence comes from each language, then choose the language with the highest probability as our guessed label. This sort of statistical model always works better with more data, so longer sentences will usually be more reliably classified than short sentences.

After testing each of the 50,000 test sentences our final correctness is 99.84% correct, or 49924 out of 50000 correct. This is pretty good, but why isn't it 100%? Let's look at the confusion matrix:

                 Guessed label        
            de    en    fr    da    sw  
         de 9995  3     1     1     0   
         en 1     9998  1     0     0   
Actual   fr 2     7     9990  0     1   
label    da 2     1     0     9994  3   
         sw 1     2     2     48    9947

The ordering of the columns is German, English, French, Danish, Swedish. This shows most of our errors come from Swedish being misclassified as Danish, which is understandable because the languages are very similar.

Most of the other errors were due to foreign words in the sentences, e.g. one of the French sentences that was classified as English was PLUS OCCASIONNELLEMENT LA DÉNOMINATION MIGRATING PARTIAL EPILEPSY IN INFANCY EST EMPLOYÉE, which obviously has English words in it which confuse the classifier. This sort of problem is unavoidable on real datasets.

Problems with this Approach §

We did fairly well, 99.84% correct, but it was over a very small number of languages. In this example we used 5 but there are several hundred in existence, and our accuracy would be much lower if we were trying to discriminate more of them. Also we had a lot of training material, 3 million sentences from each language to get reliable word frequencies. If less training material were used, we would expect our result to be worse. One of the major problems with this approach is that many languages cannot be split into words, which means this approach only works for subset of existing languages.

One final problem: If any of the words in the test sentence are misspelled, the word will probably not be found in any of the word lists, making identification difficult. This problem can be overcome using n-grams.

comments powered by Disqus
YBL KRQ IBF KFNLH R KFSQYRDQ MLXDQH MV TRPPVDQX - TFQZSTDSH