Cryptanalysis of the Caesar Cipher
For a recap of how the Caesar cipher works, see here
Being arguably the simplest keyed cipher, the Caesar cipher can be broken in milliseconds using automated tools. Since there are only 25 possible keys (each possible shift of the alphabet), we just try decrypting the ciphertext using each key and determine the fitness of each decryption. This form of solution is known as a 'brute force' solution, and is only possible for the very simplest of ciphers.
The crux of our approach depends on determining the fitness of a piece of decrypted text. How do we determine how 'fit' a piece of text is? We calculate the statistics of the deciphered text, and compare these statistics to those calculated from standard english text. A fitness function will give us a number, the higher the number the more likely the particular key is the correct one. For this example we will be using quadgram statistics, but others are possible.This method works by first determining the statistics of english text, then calculating the likelyhood that the ciphertext comes from the same distribution. An incorrectly deciphered (i.e. using the wrong key) message will probably contain sequences e.g. 'QKPC' which are very rare in normal english. In this way we can rank different decryption keys, the decryption key we want is the one that produces deciphered text with the fewest rare sequences.
An Example §
Our ciphertext is the following:
YMJHFJXFWHNUMJWNXTSJTKYMJJFWQNJXYPSTBSFSIXNRUQJXYHNUMJWX
To find out what the original was, we try decrypting it with each of the 25 possible keys, calculating the fitness for each trial decryption:
key plaintext fitness
------------------------------------------------------
1 : XLIGEIWEVGMTLIVMWSRISJXLIIEVPM... , -442.22
2 : WKHFDHVDUFLSKHULVRQHRIWKHHDUOL... , -495.20
3 : VJGECGUCTEKRJGTKUQPGQHVJGGCTNK... , -484.13
4 : UIFDBFTBSDJQIFSJTPOFPGUIFFBSMJ... , -490.73
5 : THECAESARCIPHERISONEOFTHEEARLI... , -246.02
6 : SGDBZDRZQBHOGDQHRNMDNESGDDZQKH... , -485.69
7 : RFCAYCQYPAGNFCPGQMLCMDRFCCYPJG... , -481.17
8 : QEBZXBPXOZFMEBOFPLKBLCQEBBXOIF... , -478.19
9 : PDAYWAOWNYELDANEOKJAKBPDAAWNHE... , -415.66
10: OCZXVZNVMXDKCZMDNJIZJAOCZZVMGD... , -488.75
11: NBYWUYMULWCJBYLCMIHYIZNBYYULFC... , -490.46
12: MAXVTXLTKVBIAXKBLHGXHYMAXXTKEB... , -490.82
13: LZWUSWKSJUAHZWJAKGFWGXLZWWSJDA... , -483.63
14: KYVTRVJRITZGYVIZJFEVFWKYVVRICZ... , -475.01
15: JXUSQUIQHSYFXUHYIEDUEVJXUUQHBY... , -466.90
16: IWTRPTHPGRXEWTGXHDCTDUIWTTPGAX... , -458.49
17: HVSQOSGOFQWDVSFWGCBSCTHVSSOFZW... , -474.67
18: GURPNRFNEPVCUREVFBARBSGURRNEYV... , -460.86
19: FTQOMQEMDOUBTQDUEAZQARFTQQMDXU... , -467.13
20: ESPNLPDLCNTASPCTDZYPZQESPPLCWT... , -454.29
21: DROMKOCKBMSZROBSCYXOYPDROOKBVS... , -461.91
22: CQNLJNBJALRYQNARBXWNXOCQNNJAUR... , -479.58
23: BPMKIMAIZKQXPMZQAWVMWNBPMMIZTQ... , -473.52
24: AOLJHLZHYJPWOLYPZVULVMAOLLHYSP... , -474.57
25: ZNKIGKYGXIOVNKXOYUTKULZNKKGXRO... , -494.13
The fitness of each of these pieces of text is shown, with the text corresponding to a key of 5 showing a much higher fitness than the other possible keys (it may be confusing because the scores are negative, we choose the least negative score, as it is the highest fitness). We could have also tried reading each sentence to find which one is readable, but this becomes impractical when there are many possible keys. The table above shows the first few characters of the decryption are:
THECAESARCIPHERISONEOFTHEEARLI...
For this example quadgram statistics were used, but the small number of Caesar cipher keys means that almost any scoring technique will work, including e.g. Chi-squared statistic or things like bigram or trigram statistics.
Python Code §
Provided here is python code for breaking the Caesar cipher. The code here uses pycipher for the Caesar cipher itself. It implements the steps described above, using the ngram_score.py file available on the quadgram statistics page.
comments powered by Disqus