Text Characterisation

There are many different ways of characterising text. Some of these methods count the number of occurances of certain characters or short strings, others measure the 'roughness' of the text distribution. Each method has its own applications.

Techniques

  • Chi-squared Statistic

    The Chi-squared Statistic is a measure of how similar two categorical probability distributions are. In cryptanalysis it can be used to break Vigenere type ciphers, including e.g. the Quagmire ciphers.

  • Identifying Unknown Ciphers

    Most ciphers that people find can not be immediately identified, this page will set out rules for identifying unknown ciphers as one of the known cipher algorithms.

  • Index of Coincidence

    The Index of Coincidence is used to characterise how 'rough' the frequency distribution of letters is. It is used when identifying substitution ciphers, since the frequency distribution of text enciphered with a monogram substitution cipher is 'spikier' than the frequency distribution of text enciphered with e.g. a digraphic substitution cipher e.g. foursquare. The I.C. gives an indication of this 'spikiness'. It is also used when identifying the period of Vigenere ciphers.

  • Monogram, Bigram and Trigram frequency counts

    Frequency analysis is the practice of counting the number of occurances of different ciphertext characters in the hope that the information can be used to break ciphers.

  • Quadgram Statistics as a Fitness Measure

    Quadgram (or Tetragraph) Statistics are used to characterise text by adding up the likelyhoods of all length 4 blocks of ciphertext. A high number means the text is very similar to english, a low number means it is not. This page describes how to calculate and use quadgram statistics.

  • Unicity Distance

    The Unicity Distance is a property of a certain cipher algorithm. It answers the question 'if we performed a brute force attack, how much ciphertext would we need to be sure our solution was the true solution?'. Better ciphers have longer unicity distances.

  • Word Statistics as a Fitness Measure

    When breaking ciphers, we often need a way of determining how similar to english a certain piece of text is. This technique determines the similarity by finding english words in the text.

GQQ RPIGD GSCUWDE RGJO WDO WT IWTO WA CROEO EOJOD SGPEOE: SRGDSO, DGCPTO, SWIBPQEUWD, RGFUC, TOGEWD, BGEEUWD GDY YOEUTO - GTUECWCQO