Word Statistics as a Fitness Measure

Introduction §

Previously we have used quadgrams to see how similar a piece of text is to english. This page will deal with determining how similar to english a piece of text is, based on the english words it contains. To calculate the probabilities of words, we need a very large corpus of text so that we can count the occurances of each word. The probability of a word is then the count divided by the total number of words encountered. To get accurate counts we need a truly huge database, fortunately for us Google has done this for us, with their trillion word database using data scraped from the web.

To calculate the likelyhood of a sentence we need to extract all the words, then multiply the probabilities of each word together to get the total sentence probability. When trying to break ciphers, though, we often have the problem that we don't know the word divisions, so extracting the words can be difficult. In this case we consider all possible segmentations, of which there are 2^(L-1), and choose the one with the highest probability as the probability of the sentence. Fortunately there are efficient algorithms for doing this which don't require us to consider all possible segmentations.

Peter Norvig has provided a very elegant solution to this problem in chapter 14 of the book 'Beautiful Data', I recommend reading it as it explains things better than I can. Python code, and the book chapter to read, can be found on Peter Norvig's website. He provides a momoized recursive solution that can use an awful lot of memory very quickly. Because we'll need to be calling this function a lot, this memory footprint makes it unsuitable. I have implemented an iterative version based on the Viterbi decoder that does the same job and uses very little memory, available below. Note that there is a slight error in Norvig's implementation, it treats e.g. 'how are' and 'How are' as different tokens, when they should be treated as one.

An Example §

Let us take the sentence HELLOHOWAREYOUTODAY and try and determine the most likely segmentation. Possible segmentations include:

HELLOHOWA REYOUTODAY
HELLO HOW ARE YOU TODAY
HE LL OH OW AR EY OU TO DA Y
H ELLOH OW AREYOU TOD AY
HELL OHO WARE YOU TODAY

etc. To compute the probabilities of these segmentations we get the probability of each word individually, and multiply them together. If we do this we find out that HELLO HOW ARE YOU TODAY is the segmentation that gives the highest total probability, and it certainly looks like the best segmentation.

Python code §

Searching through all possible segmentations is efficiently performed by the Viterbi decoder, an algorithm more commonly used for determining the most likely sequence of phonemes in automatic speech recognition. This implementation is slightly modified, as our states now have varying lengths. Its time complexity is O(LN^2) where L is the length of the sentence and N is the maximum word length under consideration. Example use:

from word_score import word_score
fitness = word_score()
print fitness.score('HELLOHOWAREYOUTODAY')

which returns (-15.227537974076597,['HELLO','HOW','ARE','YOU','TODAY']). The text must be in uppercase to work correctly. The code here relies on two files that Peter Norvig uses in his solution, the 1st and 2nd order word counts from the Google trillion word corpus.

comments powered by Disqus
Y NGP'I ZPGO AVCE GE LGM AVCE VJ OSCC VJ Y JAGMCN CYZS; VPN Y CYZS CSJJ IAVP AVCE GE LGM AVCE VJ OSCC VJ LGM NSJSUDS - Q.U.U. IGCZYSP. (IAS ESCCGOJAYK GE IAS UYPH)