Cryptanalysis of the Playfair cipher

For a recap of how the playfair cipher works, see here.

The playfair cipher is more complicated than a substitution cipher, but still easy to crack using automated approaches. It is known as a digraphic substitution cipher because pairs of letters are replaced by other pairs of letters. This obliterates any single letter frequency statistics, but the digraph statistics remain unchanged (frequencies of letter pairs). Unfortunately letter pairs have a much 'flatter' distribution than the single letter frequencies, so this complicates matters for solving the cipher using pen and paper methods.

Pen and paper methods of solving the playfair cipher are quite different to computer methods. The first step to a pen and paper method is usually to apply a 'crib', which is a known piece of plaintext to work out some of the key-square. This page will deal with solving plaintexts (around 100 characters or longer) with no crib using Simulated Annealing.

On this page we will be using an approach similar to that used for cryptanalysis of the Simple Substitution cipher. In that case we used a hill-climbing algorithm to find the correct decryption key. To recap, a hill climbing algorithm is shown below:

1. Generate a random key, called the 'parent',  decipher the ciphertext
    using this key. Rate the fitness of the deciphered text, store the result.
2. Change the key slightly (swap two characters in the key at random), 
    measure the fitness of the deciphered text using the new key.
3. If the fitness is higher with the modified key, store it as our current 
    key and discard the parent.
4. Go back to 2, unless no improvement in fitness occurred in the last 1000 
    iterations.

Unfortunately, applying this algorithm to the Playfair cipher does not have much success. The hill-climbing algorithm is always getting stuck in local maxima, where it has not reached the true decryption key, and yet there are no simple changes to the key that can be made to improve the fitness. To get around this we will modify our hill-climbing algorithm slightly to sometimes accept keys that do not increase the fitness. This is known as a 'simulated annealing' (SA) algorithm, but don't be frightened by the name, it is very similar to the hill-climbing algorithm. The SA algorithm is shown below:

1. Generate a random key, called the 'parent', decipher the ciphertext 
    using this key. 
2. Rate the fitness of the deciphered text, store the result.
3. for(TEMP = 10;TEMP >= 0; TEMP = TEMP - STEP) 
      for (count = 50,000; count>0; count--)
         Change the parent key slightly (e.g. swap two characters in the 
           key at random) to get child key, 
         Measure the fitness of the deciphered text using the child key
         set dF = (fitness of child - fitness of parent)
         If dF > 0 (fitness of child is higher than parent key), 
             set parent = child
         If dF < 0 (fitness of child is worse than parent), 
             set parent = child with probability e^(dF/T). 

Compared to hill-climbing there are a few extra steps, but they all relate to the formula e^(dF/T), so I will explain what this formula means. Here is a little table to summarise the behaviour of e^(dF/T) as df and T change (note that dF is always negative):

Variable Value of e^(dF/T)
dF is large in magnitude e^(dF/T) is close to zero
dF is close to zero e^(dF/T) is close to one
T is large e^(dF/T) is close to one
T is small e^(dF/T) is close to zero

Remember that we are treating the value of e^(dF/T) as a probability, if it is close to 1 we will almost certainly accept the unfit child, if the value is close to 0 we will almost certainly reject the unfit child key. 'e' is euler's constant 2.71828. 'dF' is the (fitness of child - fitness of parent). If the child is more fit than the parent, then we always accept it. It is only when the child is less fit than the parent do we consider accepting it. T is known as the 'temperature', which is related to the name of the algorithm 'simulated annealing'. When T is large, as it is when the program starts, the value of e^(dF/T) is closer to one, which means we accept poor children with higher probability. As the program continues, the value of T decreases, and the value of e^(dF/T) approaches zero. This means towards the end of the program we accept poor children very rarely (we still always accept children that are better than the parent). The value of e^(dF/T) is also influenced by dF, so when the child is much poorer than the parent (fitness of child - fitness of parent is large) the value of e^(dF/T) becomes small, which means we do not accept very poor children very often.

The question arises, why would we ever want to accept children that are less fit than their parent? It relates to the number of local maxima as we test many keys. If we are stuck in a local maximum, then the hill-climbing algorithm can never escape - the algorithm terminates at that local maximum. Simulated Annealing allows us to jump from local maxima to find new maxima in different locations. As the SA program continues and the temperature decreases, it becomes less likely that we can jump out of the current local maximum, until when T = 0 the SA algorithm becomes identical to the hill-climbing algorithm.

Measuring Fitness §

The fitness of a piece of text, such as a piece of plaintext, is measured by how similar the piece of text is to english text. This is called rating the 'fitness' of the text. A piece of text very similar to english will get a high score (a high fitness), while a jumble of random characters will get a low score (a low fitness). For this we will use a fitness measure based on quadgram statistics. For a guide on how to generate quadgram statistics, and some python code for rating the fitness of text, see this tutorial. This method works by first determining the statistics of english text, then calculating the probability that the ciphertext comes from the same distribution. An incorrectly deciphered (i.e. using the wrong key) message will probably contain sequences e.g. 'QKPC' which are very rare in normal english. In this way we can rank different decryption keys, the decryption key we want is the one that produces deciphered text with the highest likelyhood.

Algorithm Considerations §

When breaking the Simple Substitution cipher, we modify the parent key by swapping two characters at random. This modification alone is not enough to successfully break Playfair, it will often get stuck close to the final solution. By adding other possible modifications (swapping rows, swapping colums, reversing the key, flipping the key square left to right or top to bottom) it is possible to get to the correct key.

The Simulated Annealing algorithm is slightly more complicated than the hill-climbing algorithm. With this added complexity comes a few tuning parameters: the starting temperature, the temperature step, and the number of iterations per step. In the c code below these parameters are called TEMP, STEP and COUNT respectively. The higher COUNT is the more likely you are to find a solution, but the longer the program will run. Similarly with STEP, the lower it is the more likely a solution will be found but the longer it will take. The starting temperature is something that has to change depending on the length of the ciphertext. For ciphertexts around 100 characters long, a starting TEMP around 10 is good, but if the ciphertext is 300 characters long TEMP will need to start at 20 or so. This is something you should play with.

A good reference for solving the Playfair cipher with simulated annealing algorithms can be found in the paper "Breaking Short Playfair Ciphers with the Simulated Annealing Algorithm" by Cowan, M. Cryptologia, Vol 32, issue 1, 2008.

C Example §

I have gone with a c implementation for this example, because python gets a little bit slow. Our ciphertext is the following string:

XZOGQRWVQWNROKCOAELBXZWGEQYLGDRZXYZRQAEKLRHDUMNUXYXSXYEMXEHDGNXZYNTZON
YELBEUGYSCOREUSWTZRLRYBYCOLZYLEMWNSXFBUSDBORBZCYLQEDMHQRWVQWAEDPGDPOYH
ORXZINNYWPXZGROKCOLCCOCYTZUEUIICERLEVHMVQWLNWPRYXHGNMLEKLRHDUYSUCYRAWP
UYECRYRYXHGNBLUYSCCOUYOHRYUMNUXYXSXYEMXEHDGN

To decipher it use the code below, compile it using:

gcc -O3 -lm playfaircrack.c scoreText.c -o pfc

then run using the command:

./pfc

This will run the program with output that looks something like the following:

Running playfaircrack, this could take a few minutes...
best score so far: -1384.768677, on iteration 1
    Key: 'FZDYTKGSOHBAWUXLICERQMPNV'
    plaintext: 'ATSKVLXPPBVESHESUIBKATASLNFESZITUTTIMBLORESTANEOUTWHUTINURST
                OMATNEYFYENUBKUOOZDWHEUODSYFERETUFESIFFEINUPHWQKWOFWHEAFEDBL
                CYVGVLXPPBUIPCSZNSTOHEATEMENSCATHISHESRIESEDYFOUAELICERCRTQN
                PBEQSCETHTOMQILORESTONOWEDIXSCONCIETETHTOMKBONDWESONSOETANEO
                UTWHUTINURSTOM'
best score so far: -1151.330078, on iteration 2
    Key: 'LQDARSUMBNYIOWEVKPFGTXHZC'
    plaintext: 'THEPLAYFAIRCIPHERWASTHEFIRSTPRACTICALDIGRAPHSUBSTITUTIONCIPH
                ERTHESCHEMEWASINVENTEDINBYCHARLESWHEATSTONEBUTWASNAMEDAFTERL
                ORDPLAYFAIRWHOPROMOTEDTHEUSEOFTHECIPHERTHETECHNIQUEXNCRYPTSP
                AIRSOFLETXERSDIGRAPHSINSTEADOFSINGLELETXERSASINTHESIMPLESUBS
                TITUTIONCIPHER'

This means the correct solution was found on the second iteration. The program will continue to run until it is killed by the user. To put in a different ciphertext, change the char cipher[] variable, you can encipher your own text here. Remember that very short ciphertexts (less than 80 characters) may not be solvable, 100 or so characters is good). The quadgram statistics are stored in the qgr.h file, which is used by scoreText.c. For longer ciphertexts you may need to increase the TEMP variable.

comments powered by Disqus
HZMDOHWFZHH OH FJU MONOFA CH JFZ VOHWZH UJ MONZ, OU OH CHBOFA JUWZYH UJ MONZ CH JFZ VOHWZH UJ MONZ - JHQCY VOMTZ