Session 10: Week 19/20: <i>Letter Frequencies</i>

Document: Software Engineering 1: Course Notes

Part 1: Counting

Term 2: Weeks 11-20

Hints

Session 10: Week 19/20: Letter Frequencies

In a world where there are code makers there will, inevitably, also be code breakers. One of the simplest tools in the armoury of a code breaker is a program to count the relative frequency of each letter appearing in the cyphertext. For any given language there will be a characteristic distribution of letter frequencies in the uncoded message (the "plaintext"). The most commonly used letter in English is e, by a wide margin; t is in second place, with a and o nearly tied for third; i, n and r are also very commonly used.

If we know that a coded message uses a simple substitution cypher (such as the Caesar cypher we saw previously) then a simple count of the relative frequencies will allow us to make a fairly good guess as to the letters which have substituted for the most common english letters. Often this would be enough to allow the remaining substitutions to be easily guessed.

Of course, real cyphers are much more sophisticated and harder to break than this (how would you tackle a Vigenere cypher for example?). But letter frequency counts still form an essential tool for code breaking, albeit in conjunction with many other techniques.

Document: Software Engineering 1: Course Notes

Part 1: Counting

Term 2: Weeks 11-20

Hints

McMullin@ugmail.eeng.dcu.ie
Wed Mar 15 10:20:49 GMT 1995