A tutorial on Automatic Language Identification - ngram based

This page deals with automatically classifying a piece of text as being a certain language. A training corpus is assembled which contains examples from each of the languages we wish to identify, then we use the training information to guess what language a set of test sentences is in.

This page will deal with n-gram based methods of language identification, other methods include e.g. word-based methods.

This page will be superficially similar to the word-based language recognition page. We will focus on differentiating 5 European languages: Swedish, English, German, French and Danish. We can compare the results we get with n-grams to those with a word-based classifier. We will be basing our n-grams directly on the underlying bytes making up a piece of text. This will mean we don't have to do any character detection (determining which Unicode code points map to which characters), or any special dealing with punctuation. The big plus with this approach is it can be applied to any language, something our word-based method could not be. It will also gracefully accept misspelled words, something a word based classifier can't.

Collecting a Corpus §

We will use the Wortschatz website to get our training material, in this case we will use 3 Million sentences from each language to generate word lists along with word counts. From the counts we will calculate frequencies. A few example sentences from our English corpus:

1   Boozer blocked the path in the lane and West leapt in the air.  
2   After a number of months I may reach Mexico -or I may not.      
3   Crews rescued a man after he reportedly attempted to climb down.

No processing is appled to the files except to remove the line number and leading whitespace. Once this is done, byte n-grams are compiled. We will be looking at 1, 2, 3 and 4-grams for each of our languages.

  1. German byte counts: 1, 2, 3, 4
  2. English byte counts: 1, 2, 3, 4
  3. French byte counts: 1, 2, 3, 4
  4. Danish byte counts: 1, 2, 3, 4
  5. Swedish byte counts: 1, 2, 3, 4

Computing Probabilities §

Using the byte counts we generated in the previous section, we can calculate n-gram frequencies. For example, using 1-grams (single bytes) we have the byte 0x63 (corresponding to lowercase 'c') and 0x6C ('l') occurring with the following frequencies in each of our languages:

0x63 'c'0.0210.0220.0240.0250.009
0x6C 'l'0.0280.0320.0400.0410.041

To compute the total sentence probability, we multiply each of the n-gram probabilities. Notice that when we multiply a lot of small probabilities our final result gets very small; for this reason we usually use log probabilities instead and add them (using the following identity: log(a*b) = log(a) + log(b)). Previously we calculated that 0x63 has a probability of 0.021 in German, this corresponds to a log probability of log(0.021) = -3.85. From here on we will use the log probabilities of the n-grams.

An Example: we wish to determine the probability that the sentence 'The cat' comes from each of our 5 languages using 2-grams. Below are the log probabilities for each of the 2-grams in the sentence from our 5 languages:

0x5468 'Th'-8.67-6.37-9.60-9.40-10.00
0x6865 'he'-4.96-4.08-6.51-6.03-6.23
0x6520 'e '-3.94-3.56-3.24-3.71-4.74
0x2063 ' c'-10.10-5.00-4.69-7.66-7.78
0x6361 'ca'-9.46-5.89-6.28-8.27-8.47
0x6174 'at'-5.63-4.80-5.37-5.04-4.81

For English the sentence probability is (-6.37) + (-4.08) + (-3.56) + (-5.00) + (-5.89) + (-4.80) = -29.73, which is a much higher probability than any of the other languages.

How do we calculate the probabilities of byte sequences that don't appear in our training corpus? After all, some n-grams are just rare and even though they don't occur in our training samples they may occur in some of our testing samples. We get around this by simply assigning a count of 1 to unseen n-grams. This prevents our total sentence probability from becoming zero as soon as an unusual n-gram is seen.

Testing our Model §

We will use 10,000 sentences (again from Wortschatz) from each of the 5 languages (50,000 sentences total) to test our model. Each of the test sentences are new i.e. they were not part of our training corpus.

To decide what language a sentence was in, we calculate the probability each sentence comes from each language, then choose the language with the highest probability as our guessed label. This is kown as a naive Bayes classifier. This sort of statistical model always works better with more data, so longer sentences will usually be more reliably classified than short sentences.

We repeat this procedure for each of the 1, 2, 3, and 4-grams to see if the the size of our n-grams has an effect on our classification ability. The results are shown below:

percentage correct:96.1099.2599.8399.95

The correctness jumps considerably going from 1- to 2-grams, then steadily climbs. After testing each of the 50,000 test sentences our final correctness for 4-grams is 99.95% correct, or 49975 out of 50000 correct. This is pretty good, but why isn't it 100%? Let's look at the confusion matrix:

                   Guessed label         
             de    en    fr    da    sw  
         de  9999  0     1     0     0   
         en  1     9997  2     0     0   
Actual   fr  4     6     9990  0     0   
label    da  0     0     0     9998  2   
         sw  0     2     0     7     9991

The ordering of the columns is German, English, French, Danish, Swedish. This shows most of our errors are evenly spread. Incidently, this result is better than the one we got on the same data using a word-based classifier.

Reducing the amount of Training Material §

Our experiment above used quite a lot of training material, about 3 million sentences or about 350 megabytes of text. For this example we will use only 10,000 sentences for training (about 3% of the training material). As a result we expect to get worse results, but how much worse? The set up for this and the testing are identical to the previous section.

The results using a train set of 10,000 sentences are below:

percentage correct:96.1399.2599.8099.92

I was personally suprised to see these results so close to the results using much more training data. So there you go.

comments powered by Disqus