Code breaking using a Markov Chain Monte Carlo method
Can you decipher this text?
WIUBSIWFBZFIBLRYS BYWBWIKURBOZPIUZBW ZKBHZWZOZBL BUIIPBZBUKYABIW BOZMBDNQUBIG
KBUR BEIKO KBUIBGYQYUBQ G KZSBGYSSZF QBITBKNQQYZWBA ZQZWUQBL BTINWOBUR
BEIMQBZWOBFYKSQBITBW ZKSMBUR B WUYK BGYSSZF BQNTT KYWFBTKIXBUKZHRIXZBZBOZWF
KINQBYWT HUYINQBOYQ ZQ BITBUR B M QBLRYHRBQAK ZOQBZSZKXYWFSMBTKIXBIW
BHRYSOBUIBZWIUR KBL BQZLBUR BOYQ ZQ BYWBZSSBITBYUQBGZKMYWFBO FK QBZXIWFBUR
BHRYSOK WBQIX BITBUR XBRZOBQLISS WBK OO W OBSYOQBZBOYQHRZKF
BITBANQBLZQBHIXYWFBTKIXBUR B M QBITBIUR KQBZWOBUR
MBHINSOBWIUBSIIPBUILZKOBZBSYFRUBIKBUR BQNW
This is a substitution cipher (also known as Caesar cipher) which simply replaces a character with another character. You shouldn’t use this encryption method for any serious encryption. You’ll soon see why.
Hint: The original text is English.
If you have a large piece of encrypted text, you could count each character and compare that to the general frequency of characters in English. For example the most common character in English is ‘e’, so the most used character in the encrypted text probably stands for ‘e’. Other common characters probably stand for ‘t’ and ‘a’, etc. You could try to decipher some parts of the text using this method. However the shorter the sample of the encrypted text you have the more vague these assumptions are. We need a better method!
Markov Chain
A Markov process is a process of state changes where each state transition has a specific probability and is only dependant on the current state.
With respect to the text deciphering problem we assume that a text, a sequence of characters is the realization of a Markov process, where each character is a state of the Markov chain.
Let’s take for example the string ‘Markov’. You could read that as the process went from state ‘M’ to state’A’ to state ‘R’ to state ‘K’ to state ‘O’ to state ‘V’. Each transition (M -> A, A -> R, etc.) has a certain probility. Taking the 26 characters of the alphabet, you can store that as 26 x 26 matrix (Note: This matrix is not symmetrical, i. e. M -> A is not necessarily equal to A -> M).
An important question is: What is the likelihood that a certain string is a realization of a certain Markov chain? Lets take the example ‘Markov’ again. What is the likelihood that this string is a realization of a Markov chain with identical transition probabilties ((A->A)=1/26, (A->B)=1/26, (A->C)=1/26, …)? There are 5 transitions which each occur with a probability of 1/26, so it is (1/26)^5 which is 0.000000084165336 . You might already see a problem here. Even with a short string of just 6 characters we’re getting very small probabilities. That’s why usually the log likelihood is used, which is log((1/26)^5) = -16.29048269010721 . That’s a figure we and especially a CPU can work much better with.
In the example we assumed the same probability for each of the transitions between the characters. Of course in reality that’s not the case. In fact each language has a quite characteristic transition probability matrix. For example in English the letter ‘Q’ is nearly always followed by an ‘U’, sometimes by a space, basically never by another letter.
We’ll get a big piece of literature written in English (“War and Peace”), estimate these transition probabilities and build a Markov chain.
How does that help us with the task of deciphering the text?
What we’re trying to achieve is to find a character mapping which decodes the encrypted text, so that the decrypted text has a high likelihood to be a realization of the Markov chain we just constructed.
Monte Carlo
We could now randomly create different character mappings, decode the text with these character mappings, calculate the likelihoods and hope that we find one with a high likelihood in a reasonable amount of time. Is that feasible? Well, assuming 27 characters (26 letters and space) there are 27! (roughly 10^28) possible character mappings! We need a better way to create these mappings!
Lets start with a random mapping, but instead of randomly creating new mappings we successively improve this mapping. For our first guess we could use some prior assumptions we can make about the mapping (see above, taking letter frequencies into account), but the method is so ‘greedy’, that this is not even necessary.
We start from a completely random mapping, decipher the text, calculate the likelihood that this deciphered text is a realization of the Markov chain. Then we slightly modify this mapping by randomly swapping two characters, e. g. A -> D and G -> C, then in the modified mapping G -> D and A -> C. Again decipher the text and calculate the likelihood. If the likelihood is better than the previous mapping, we accept the modified mapping. If it is worse, we might still accept it, with a probability reflecting the difference of the likelihoods. In practice that’s the likelihood ratio. The probability of acceptance is: p = e^(ll_mod - ll) (as we’re working with log likelihoods (ll) the likelihood ratio is the difference between the log likelihoods). If the modified log likelihood is greater than the previous one then p > 1, i. e. the modified mapping is accepted. If it is smaller then p < 1 but there’s still the possibility that we accept it. This is important because it allows the algorithm to escape from a local maximum. And this way of generating potential solutions is called Monte Carlo sampling.
It’s so efficient, that in our encrypted text example this sampling method finds a good solution within a few thousand (< 10000) trials (compare that to the figure 10^28)!
Implementation
You can find the full source code on my Gitlab repository.
Here I just want to point out the principal methods.
We need a method to count the transitions between the characters:
final char[] characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0 ".toCharArray();
long[][] transitionCounts(String input) {
long[][] matrix = new long[characters.length][characters.length];
char[] chars = input.toCharArray();
char last = chars[0];
for (int i = 1; i < chars.length; i++) {
matrix[index(last)][index(chars[i])]++;
last = chars[i];
}
return matrix;
}
The returned matrix will contain the counts how often a character was followed by another character,
e.g. matrix[0][0]
is how often an ‘A’ is followed by an ‘A’, matrix[0][1]
how often an ‘A’ is
followed by a ‘B’, etc.
This count matrix has to be turned into a probability matrix. Note: We have to
add 1
to the counts in order to avoid probabilities of 0
(as we’re working with the log of the
probabilities).
double[][] transitionProbabilities(long[][] transitionCounts) {
double[][] matrix = new double[transitionCounts.length][transitionCounts.length];
for (int i = 0; i < transitionCounts.length; i++) {
long total = 0;
for (int j = 0; j < transitionCounts[i].length; j++) {
total += transitionCounts[i][j] + 1;
}
for (int j = 0; j < transitionCounts[i].length; j++) {
matrix[i][j] = (double) (transitionCounts[i][j] + 1)
/ (double) total;
}
}
return matrix;
}
A method to calculate the (log) likelihood that a text could have been produced by the markov chain:
double likelihood(String text) {
double ll = 0;
char last = text.charAt(0);
for (int i = 1; i < text.length(); i++) {
char cur = text.charAt(i);
ll += Math.log(transitionsProbabilities[index(last)][index(cur)]);
last = cur;
}
return ll;
}
I wrapped these methods up in a class called MarkovChain.java, together with a method which parses a reference text to estimate the transition probabilities.
The algorithm itself (which can be found in MCMCTextDecrypter.java):
final Random rand = new Random();
String decipher(String referenceText, String encryptedText, int iterations) {
// Create a markov chain and initialize it with the reference text
MarkovChain mc = new MarkovChain(characters);
mc.initialize(referenceText);
// Start off with a random mapping (cipher)
CharacterMapping mapping = new CharacterMapping(characters);
// Try to decode the secret text with the random mapping
String decodedText = mapping.decodeText(encryptedText);
// Get the likelihood that this decoded text could have
// been generated from the markov chain
double maxLL = mc.likelihood(decodedText);
for (int i = 0; i < iterations; i++) {
// Randomly swap two characters of the mapping
CharacterMapping trial = mapping.clone();
trial.randomSwap();
// Try again to decrypt the text
String tmp = trial.decodeText(encryptedText);
// If the likelihood is better (the likelihood ratio is > 1)
// accept the modified mapping, if not accept it only with
// a probability equal to the likelihood ratio.
double ll = mc.likelihood(tmp);
if (rand.nextDouble() < Math.exp(ll - maxLL)) {
maxLL = ll;
decodedText = tmp;
mapping = trial;
}
}
return decodedText;
}
Running it with 10000 iterations gives you a pretty good chance to decipher the text.
Checkout the code and play around a bit with the Main.java class.
You will notice that:
-
You sometimes just get glibberish instead of the decrypted text. The MCMC algorithm doesn’t guarantee to find the highest likelihood possible. The original text is not the only text which has a high likelihood to be a result of the markov chain. Other texts although complete nonesense will have similar likelihoods. In fact…
-
The original text does not even have the highest likelihood possible. Slightly different character mappings can have a higher likelihood than the real one which was used to encrypt the text. For example you’ll quite often see a ‘QUST’ instead of ‘JUST’, because although not a word, it has a higher likelihood.
Here’s an example output of a run:
Start log likelihood: -3110.9801413752443
Iterations: 10000
Accepted (%): 2.82
Rejected (%): 97.18
Log likelihood of solution: -1217.6445330437575
Estimated cipher which was used:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0
||||||||||||||||||||||||||||
ZEHO TFRY0PSXWIADKQUNGLCMJVB
Decrypted text:
NOT LONG AGO WHILE IN NORTH DAKOTA NEAR CANADA WE TOOK A TRIP ONE DAY QUST OVER
THE BORDER TO VISIT SEVERAL VILLAGES OF RUSSIAN PEASANTS WE FOUND THE BOYS AND
GIRLS OF NEARLY THE ENTIRE VILLAGE SUFFERING FROM TRACHOMA A DANGEROUS
INFECTIOUS DISEASE OF THE EYES WHICH SPREADS ALARMINGLY FROM ONE CHILD TO
ANOTHER WE SAW THE DISEASE IN ALL OF ITS VARYING DEGRES AMONG THE CHILDREN SOME
OF THEM HAD SWOLLEN REDDENED LIDS A DISCHARGE OF PUS WAS COMING FROM THE EYES OF
OTHERS AND THEY COULD NOT LOOK TOWARD A LIGHT OR THE SUN
(Log likelihood of the original text: -1221.9418810429845)
(It’s a random snippet from some book (“The Mother and Her Child”) I also found on Project Guttenberg).
This post was inspired by Text Decryption Using MCMC