Cryptanalysis of the Foursquare Cipher
For a recap of how the Foursquare cipher works, see here.
The Foursquare cipher is more complicated than a substitution cipher, but still easy to crack using automated approaches. It is known as a digraphic substitution cipher because pairs of letters are replaced by other pairs of letters. This obliterates any single letter frequency statistics, but the digraph statistics remain unchanged (frequencies of letter pairs). Unfortunately letter pairs have a much 'flatter' distribution than the single letter frequencies, so this complicates matters for solving the cipher using pen and paper methods.
This page will deal with solving plaintexts (around 100 characters or longer) with no crib using Simulated Annealing. This approach is more fully described on the page cryptanalysis of the playfair cipher, which is broken using the same technique. In short, Simulated Annealing involves making small changes to an initial key and measuring the fitness of the deciphered text. If the fitness improves, we keep the modified key, if not we keep the key with a certain small probability. Simulated Annealing is similar to hill climbing, which we used to break substitution ciphers, but is less likely to get caught in local maxima.
Measuring Fitness §
The fitness of a piece of text, such as a piece of plaintext, is measured by how similar the piece of text is to english text. This is called rating the 'fitness' of the text. A piece of text very similar to english will get a high score (a high fitness), while a jumble of random characters will get a low score (a low fitness). For this we will use a fitness measure based on quadgram statistics. For a guide on how to generate quadgram statistics, and some python code for rating the fitness of text, see this tutorial. This method works by first determining the statistics of english text, then calculating the probability that the ciphertext comes from the same distribution. An incorrectly deciphered (i.e. using the wrong key) message will probably contain sequences e.g. 'QKPC' which are very rare in normal english. In this way we can rank different decryption keys, the decryption key we want is the one that produces deciphered text with the highest likelyhood.
C Example §
I have gone with a c implementation for this example, because python gets a little bit slow. Our ciphertext is the following string:
HMMKEQESDTMDHLAWFWMNKSOSFOMRFNLWLKHNSQGGEKXEOLLVDXNRSQQGARTFKSAVNUDLFNDHESPZGQTW ESAGPGSQSQSTPKUSBBQLQHESAGPGSQSQGXLNAVHTPMHMKKNYGSUGDMTPDGFNKYAVHXLWGEKRILESLZZO FNAVIHRHRKAGHSMYUGEGNSRGAVMVOQPRLNKRXLMYLQPXILESQYBNRHRKAGKYQXDIHMPGPYOERZOLBEZL URFWLWUOLDDPNSQYAGMUQPQWESBEZVEQESDTMDBQLWDIUSHB
To decipher it use the code below, compile it using:
gcc -O3 -lm foursquarecrack.c scoreText.c -o fsc
then run using the command:
./fsc
This will run the program with output that looks something like the following:
Running foursquarecrack, this could take a few minutes... best score so far: -1239.505249, on iteration 1 Key: 'KFMLUGWSQEPOZTNRBHDAVXCIY','UGSVKFIZMOYXPQRWTHLNCABED' plaintext: 'THECIPHERTEXTSQUARESCANBEGENERATEDUSINGAKEYWORDDROPPINGDUPLICAT ELETTERSTHENFILLTHEREMAININGSPACESWITHTHEREMAININGLETTERSOFTHEA LPHABETINORDERALTERNATIVELYTHECIPHERTEXTSQUARESCANBEGENERATEDCO MPLETELYRANDOMLYTHEFOURSQUAREALGORITHMALLOWSFORTWOSEPARATEKEYSO NEFOREACHOFTHETWOCIPHERTEXTMATRICESX'
This means the correct solution was found on the first iteration. The program will continue to run until it is killed by the user. To put in a different ciphertext, change the char cipher[] variable, you can encipher your own text here. Remember that very short ciphertexts (less than 80 characters) may not be solvable, 200 or so characters is good). The quadgram statistics are stored in the qgr.h file, which is used by scoreText.c. For longer ciphertexts you may need to increase the TEMP variable.
- foursquarecrack.c
- scoreText.c
- scoreText.h
- qgr.h.zip Used by scoreText.c (extract to use)