Cryptanalysis of the Vigenere Cipher, Part 2

For a recap of how the Vigenere Cipher works, see here

The Vigenere cipher was thought to be completely unbreakable for hundreds of years, and indeed, if very long, completely random keys are used the Vigenere cipher can be unbreakable. But if short keys are used, or if we have a lot of ciphertext compared to the key length, the Vigenere cipher is quite solvable.

In part 1 of Cryptanalysis of the Vigenere Cipher, we used the Index of Coincidence to identify the period of the cipher, then the Chi-squared statistic to find the specific key. But, as we found out, these methods are not foolproof. As a result, this page will describe a much simpler technique to find good Vigenere keys: local random search.

This page deals with automated cracking of Vigenere ciphers with no known crib. The only thing we know about the plaintext is that it is English. For cracking these ciphers by hand or with a crib, different techniques can be used.

For the approach described below, we need a way of determining how similar a piece of text is to English text. This is called rating the 'fitness' of the text. A piece of text very similar to English will get a high score (a high fitness), while a jumble of random characters will get a low score (a low fitness). For this we will use a fitness measure based on quadgram statistics. This method works by first determining the statistics of english text, then calculating the likelyhood that the ciphertext comes from the same distribution. An incorrectly deciphered (i.e. using the wrong key) message will probably contain sequences e.g. 'QKPC' which are very rare in normal English. In this way we can rank different decryption keys, the decryption key we want is the one that produces deciphered text with the fewest rare sequences.

To find the key length, we will use this technique starting with key length = 2, then try key length = 3 etc. until we get to 25 or so. This method is fast enough that we can search all key lengths in a fairly short time.

The Algorithm §

This example uses an assumed key length of 7. We start with an initial key, which could be chosen at random, or simply 7 'A's e.g. 'AAAAAAA'. We call this the 'parent' key. Try all possibilities of A-Z in the first key letter e.g. 'AAAAAAA', 'BAAAAAA', 'CAAAAAA',... These are called the 26 'child' keys. If any of the 26 child keys have a higher fitness than the parent, set the parent to the highest scoring child key. If 'FAAAAAA' turned out to be the best scoring child key, this becomes the parent. We now move to the second column and try all possibilities of 'FAAAAAA', 'FBAAAAA', 'FCAAAAA',... The parent is set to the best child once again. This procedure is repeated for all key letters. Once the 7th key letter is reached, start again at the first position and repeat the procedure. Once the algorithm goes through all key positions and the parent is not changed, we have found a local optimum and the algorithm stops.

To test the fitness of particular key, we try decrypting the message with that key, and calculate the fitness of the decrypted text using quadgram statistics. A modification that should be made is to only calculate fitness of the plaintext letters corresponding to the key letters we have searched. For example, if our current key is 'CIPHAAA', i.e. we have only searched 4 of the 7 key letters, (the original text is DEFENDTHEEASTWALLOFTHECASTLE)

current key: CIPHAAACIPHAAACIPHAAACIPHAAA
 ciphertext: FMULRULJMTHWKOCTAVJKZGKPZXCW
  decrypted: DEFERULHEEAWKOALLOJKZECASXCW

we should only calculate fitness from the shaded parts of the decrypted text. This reduces the effect on the total fitness of the garbled text that is present due to unsearched components of the key.

Note that we do not search all possible keys - this would necessitate around 26^N trial keys to decipher, which for N > 5 starts to become a large number indeed. We search each position independent of the others; this cuts our number of trial keys to 26*N which is much more manageable. Note that sometimes the correct key will not be found, in which case you could try a different starting 'parent' and rerun the program. It may simply be that the ciphertext is too short, or contains too many rare quadgrams. This will mean garbled text may score higher than the original English. This is a limitation of any algorithm based on statistical properties of text, including single letter frequencies, bigrams, trigrams etc.

Python Code §

The code here uses pycipher for the Vigenere cipher. It implements the steps described above, using the ngram_score.py file available on the quadgram statistics page.

comments powered by Disqus
GQQ RPIGD GSCUWDE RGJO WDO WT IWTO WA CROEO EOJOD SGPEOE: SRGDSO, DGCPTO, SWIBPQEUWD, RGFUC, TOGEWD, BGEEUWD GDY YOEUTO - GTUECWCQO