Chi-squared Statistic
The Chi-squared Statistic is a measure of how similar two categorical probability distributions are. If the two distributions are identical, the chi-squared statistic is 0, if the distributions are very different, some higher number will result. The formula for the chi-squared statistic is:
where CA is the count (not the probability) of letter A, and EA is the expected count of letter A.
This page will describe the use of the chi-squared statistic for cryptanalysis. Ordinarily, statisticians use the chi-squared statistic for measuring the goodness of fit of data. Unlike statisticians, we make no assumptions about the distribution of our data, and draw no conclusions about the significance of the result. We simply use the method to suggest a possible decryption.
Javascript Chi-squared Calculator §
Example Solving a Caesar Cipher §
If we were to try solving a Caesar cipher by hand, a good first step would be to calculate the frequency distribution of the ciphertext characters. We could then compare them to the frequency distribution of english, and by shifting the two frequency distributions relative to one another we could find the shift that was used to encipher the plaintext. This would occur when the shifted english frequencies line up with the ciphertext frequencies i.e. common letters in english are common in the ciphertext and rare english letters are rare in the ciphertext.
The chi-squared statistic is a way for a computer to essentially perform this procedure. Let us say we have a message enciphered using the Caesar cipher:
aoljhlzhyjpwolypzvulvmaollhysplzaruvduhukzptwslzajpwoly zpapzhafwlvmzbizapabapvujpwolypudopjolhjoslaalypuaolwsh pualeapzzopmalkhjlyahpuubtilyvmwshjlzkvduaolhswohila
We also know the probabilities of characters occuring in normal english text. The chi-squared statistic uses counts, not probabilities. As a result we need to use the probabilities to calculate the expected count for each letter. If the letter E occurs with a proability of 0.127, we would expect it to occur 12.7 times in 100 characters. To calculate the expected count just multiply the probability by the length of the ciphertext. The cipher shown above is 162 characters, so we expect E to appear 162*0.127 = 20.57 times.
To solve the Caesar cipher, we decipher the ciphertext with each of the 25 possible keys and calculate the Chi-squared statistic for each key. This compares the letter counts in each decryption with what we would expect the counts to be if the text were english. To calculate the Chi-squared statistic for the ciphertext above, we see the letter A appears 18 times. If it were english, we would expect it to appear 162*0.082 = 13.284 times. Using this information we calculate the following
We also need to perform this procedure for the other letters, then add the 26 results that we get. The result of this is around 1634.09. To find the correct key we have to do this for each key, the results of this can be seen below:
decryption key plaintext chi-squared ------------------------------------------------------------- 0 AOLJHLZHYJPWOLYPZVULVMAOL ... 1634.09 1 ZNKIGKYGXIOVNKXOYUTKULZNK ... 3441.13 2 YMJHFJXFWHNUMJWNXTSJTKYMJ ... 2973.71 3 XLIGEIWEVGMTLIVMWSRISJXLI ... 1551.67 4 WKHFDHVDUFLSKHULVRQHRIWKH ... 1199.40 5 VJGECGUCTEKRJGTKUQPGQHVJG ... 1466.62 6 UIFDBFTBSDJQIFSJTPOFPGUIF ... 1782.26 7 THECAESARCIPHERISONEOFTHE ... 33.67 8 SGDBZDRZQBHOGDQHRNMDNESGD ... 1747.07 9 RFCAYCQYPAGNFCPGQMLCMDRFC ... 1386.62 10 QEBZXBPXOZFMEBOFPLKBLCQEB ... 3423.96 11 PDAYWAOWNYELDANEOKJAKBPDA ... 809.38 12 OCZXVZNVMXDKCZMDNJIZJAOCZ ... 4646.96 13 NBYWUYMULWCJBYLCMIHYIZNBY ... 724.11 14 MAXVTXLTKVBIAXKBLHGXHYMAX ... 2159.43 15 LZWUSWKSJUAHZWJAKGFWGXLZW ... 1787.26 16 KYVTRVJRITZGYVIZJFEVFWKYV ... 3527.17 17 JXUSQUIQHSYFXUHYIEDUEVJXU ... 2967.66 18 IWTRPTHPGRXEWTGXHDCTDUIWT ... 1368.70 19 HVSQOSGOFQWDVSFWGCBSCTHVS ... 929.17 20 GURPNRFNEPVCUREVFBARBSGUR ... 461.19 21 FTQOMQEMDOUBTQDUEAZQARFTQ ... 4395.68 22 ESPNLPDLCNTASPCTDZYPZQESP ... 703.43 23 DROMKOCKBMSZROBSCYXOYPDRO ... 1226.79 24 CQNLJNBJALRYQNARBXWNXOCQN ... 1817.85 25 BPMKIMAIZKQXPMZQAWVMWNBPM ... 2939.16
We can now deduce that the key used to encipher the message was 7, since the chi-squared statistic is much lower when calculated on the text deciphered using that key.
comments powered by Disqus