Jul 25, 2018 - Code breaking using a Markov Chain Monte Carlo method

Code breaking using a Markov Chain Monte Carlo method

Can you decipher this text?

WIUBSIWFBZFIBLRYS BYWBWIKURBOZPIUZBW ZKBHZWZOZBL BUIIPBZBUKYABIW BOZMBDNQUBIG 
KBUR BEIKO KBUIBGYQYUBQ G KZSBGYSSZF QBITBKNQQYZWBA ZQZWUQBL BTINWOBUR 
BEIMQBZWOBFYKSQBITBW ZKSMBUR B WUYK BGYSSZF BQNTT KYWFBTKIXBUKZHRIXZBZBOZWF 
KINQBYWT HUYINQBOYQ ZQ BITBUR B M QBLRYHRBQAK ZOQBZSZKXYWFSMBTKIXBIW 
BHRYSOBUIBZWIUR KBL BQZLBUR BOYQ ZQ BYWBZSSBITBYUQBGZKMYWFBO FK  QBZXIWFBUR 
BHRYSOK WBQIX BITBUR XBRZOBQLISS WBK OO W OBSYOQBZBOYQHRZKF 
BITBANQBLZQBHIXYWFBTKIXBUR B M QBITBIUR KQBZWOBUR 
MBHINSOBWIUBSIIPBUILZKOBZBSYFRUBIKBUR BQNW

This is a substitution cipher (also known as Caesar cipher) which simply replaces a character with another character. You shouldn’t use this encryption method for any serious encryption. You’ll soon see why.

Hint: The original text is English.

If you have a large piece of encrypted text, you could count each character and compare that to the general frequency of characters in English. For example the most common character in English is ‘e’, so the most used character in the encrypted text probably stands for ‘e’. Other common characters probably stand for ‘t’ and ‘a’, etc. You could try to decipher some parts of the text using this method. However the shorter the sample of the encrypted text you have the more vague these assumptions are. We need a better method!

Markov Chain

A Markov process is a process of state changes where each state transition has a specific probability and is only dependant on the current state.

With respect to the text deciphering problem we assume that a text, a sequence of characters is the realization of a Markov process, where each character is a state of the Markov chain.

Let’s take for example the string ‘Markov’. You could read that as the process went from state ‘M’ to state’A’ to state ‘R’ to state ‘K’ to state ‘O’ to state ‘V’. Each transition (M -> A, A -> R, etc.) has a certain probility. Taking the 26 characters of the alphabet, you can store that as 26 x 26 matrix (Note: This matrix is not symmetrical, i. e. M -> A is not necessarily equal to A -> M).

An important question is: What is the likelihood that a certain string is a realization of a certain Markov chain? Lets take the example ‘Markov’ again. What is the likelihood that this string is a realization of a Markov chain with identical transition probabilties ((A->A)=1/26, (A->B)=1/26, (A->C)=1/26, …)? There are 5 transitions which each occur with a probability of 1/26, so it is (1/26)^5 which is 0.000000084165336 . You might already see a problem here. Even with a short string of just 6 characters we’re getting very small probabilities. That’s why usually the log likelihood is used, which is log((1/26)^5) = -16.29048269010721 . That’s a figure we and especially a CPU can work much better with.

In the example we assumed the same probability for each of the transitions between the characters. Of course in reality that’s not the case. In fact each language has a quite characteristic transition probability matrix. For example in English the letter ‘Q’ is nearly always followed by an ‘U’, sometimes by a space, basically never by another letter.

We’ll get a big piece of literature written in English (“War and Peace”), estimate these transition probabilities and build a Markov chain.

How does that help us with the task of deciphering the text?

What we’re trying to achieve is to find a character mapping which decodes the encrypted text, so that the decrypted text has a high likelihood to be a realization of the Markov chain we just constructed.

Monte Carlo

We could now randomly create different character mappings, decode the text with these character mappings, calculate the likelihoods and hope that we find one with a high likelihood in a reasonable amount of time. Is that feasible? Well, assuming 27 characters (26 letters and space) there are 27! (roughly 10^28) possible character mappings! We need a better way to create these mappings!

Lets start with a random mapping, but instead of randomly creating new mappings we successively improve this mapping. For our first guess we could use some prior assumptions we can make about the mapping (see above, taking letter frequencies into account), but the method is so ‘greedy’, that this is not even necessary.

We start from a completely random mapping, decipher the text, calculate the likelihood that this deciphered text is a realization of the Markov chain. Then we slightly modify this mapping by randomly swapping two characters, e. g. A -> D and G -> C, then in the modified mapping G -> D and A -> C. Again decipher the text and calculate the likelihood. If the likelihood is better than the previous mapping, we accept the modified mapping. If it is worse, we might still accept it, with a probability reflecting the difference of the likelihoods. In practice that’s the likelihood ratio. The probability of acceptance is: p = e^(ll_mod - ll) (as we’re working with log likelihoods (ll) the likelihood ratio is the difference between the log likelihoods). If the modified log likelihood is greater than the previous one then p > 1, i. e. the modified mapping is accepted. If it is smaller then p < 1 but there’s still the possibility that we accept it. This is important because it allows the algorithm to escape from a local maximum. And this way of generating potential solutions is called Monte Carlo sampling.

It’s so efficient, that in our encrypted text example this sampling method finds a good solution within a few thousand (< 10000) trials (compare that to the figure 10^28)!

Implementation

You can find the full source code on my Gitlab repository.

Here I just want to point out the principal methods.

We need a method to count the transitions between the characters:

final char[] characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0 ".toCharArray();

long[][] transitionCounts(String input) {
    long[][] matrix = new long[characters.length][characters.length];
    char[] chars = input.toCharArray();
    char last = chars[0];
    for (int i = 1; i < chars.length; i++) {
        matrix[index(last)][index(chars[i])]++;
        last = chars[i];
    }
    return matrix;
    }

The returned matrix will contain the counts how often a character was followed by another character, e.g. matrix[0][0] is how often an ‘A’ is followed by an ‘A’, matrix[0][1] how often an ‘A’ is followed by a ‘B’, etc.

This count matrix has to be turned into a probability matrix. Note: We have to add 1 to the counts in order to avoid probabilities of 0 (as we’re working with the log of the probabilities).

double[][] transitionProbabilities(long[][] transitionCounts) {
    double[][] matrix = new double[transitionCounts.length][transitionCounts.length];
    for (int i = 0; i < transitionCounts.length; i++) {
        long total = 0;
        for (int j = 0; j < transitionCounts[i].length; j++) {
            total += transitionCounts[i][j] + 1;
        }
        for (int j = 0; j < transitionCounts[i].length; j++) {
            matrix[i][j] = (double) (transitionCounts[i][j] + 1)
                    / (double) total;
        }
    }
    return matrix;
}

A method to calculate the (log) likelihood that a text could have been produced by the markov chain:

double likelihood(String text) {
    double ll = 0;
    char last = text.charAt(0);
    for (int i = 1; i < text.length(); i++) {
        char cur = text.charAt(i);
        ll += Math.log(transitionsProbabilities[index(last)][index(cur)]);
        last = cur;
    }
    return ll;
}

I wrapped these methods up in a class called MarkovChain.java, together with a method which parses a reference text to estimate the transition probabilities.

The algorithm itself (which can be found in MCMCTextDecrypter.java):

final Random rand = new Random();

String decipher(String referenceText, String encryptedText, int iterations) {

    // Create a markov chain and initialize it with the reference text
    MarkovChain mc = new MarkovChain(characters);
    mc.initialize(referenceText);

    // Start off with a random mapping (cipher)
    CharacterMapping mapping = new CharacterMapping(characters);

    // Try to decode the secret text with the random mapping
    String decodedText = mapping.decodeText(encryptedText);

    // Get the likelihood that this decoded text could have
    // been generated from the markov chain
    double maxLL = mc.likelihood(decodedText);

    for (int i = 0; i < iterations; i++) {

        // Randomly swap two characters of the mapping
        CharacterMapping trial = mapping.clone();
        trial.randomSwap();

        // Try again to decrypt the text
        String tmp = trial.decodeText(encryptedText);

        // If the likelihood is better (the likelihood ratio is > 1) 
        // accept the modified mapping, if not accept it only with
        // a probability equal to the likelihood ratio.
        double ll = mc.likelihood(tmp);
        if (rand.nextDouble() < Math.exp(ll - maxLL)) {
            maxLL = ll;
            decodedText = tmp;
            mapping = trial;
        }
    }

    return decodedText;
}

Running it with 10000 iterations gives you a pretty good chance to decipher the text.

Checkout the code and play around a bit with the Main.java class.

You will notice that:

  • You sometimes just get glibberish instead of the decrypted text. The MCMC algorithm doesn’t guarantee to find the highest likelihood possible. The original text is not the only text which has a high likelihood to be a result of the markov chain. Other texts although complete nonesense will have similar likelihoods. In fact…

  • The original text does not even have the highest likelihood possible. Slightly different character mappings can have a higher likelihood than the real one which was used to encrypt the text. For example you’ll quite often see a ‘QUST’ instead of ‘JUST’, because although not a word, it has a higher likelihood.

Here’s an example output of a run:

Start log likelihood: -3110.9801413752443
Iterations: 10000
Accepted (%): 2.82
Rejected (%): 97.18
Log likelihood of solution: -1217.6445330437575
Estimated cipher which was used:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0 
||||||||||||||||||||||||||||
ZEHO TFRY0PSXWIADKQUNGLCMJVB

Decrypted text:
NOT LONG AGO WHILE IN NORTH DAKOTA NEAR CANADA WE TOOK A TRIP ONE DAY QUST OVER 
THE BORDER TO VISIT SEVERAL VILLAGES OF RUSSIAN PEASANTS WE FOUND THE BOYS AND 
GIRLS OF NEARLY THE ENTIRE VILLAGE SUFFERING FROM TRACHOMA A DANGEROUS 
INFECTIOUS DISEASE OF THE EYES WHICH SPREADS ALARMINGLY FROM ONE CHILD TO 
ANOTHER WE SAW THE DISEASE IN ALL OF ITS VARYING DEGRES AMONG THE CHILDREN SOME 
OF THEM HAD SWOLLEN REDDENED LIDS A DISCHARGE OF PUS WAS COMING FROM THE EYES OF 
OTHERS AND THEY COULD NOT LOOK TOWARD A LIGHT OR THE SUN

(Log likelihood of the original text: -1221.9418810429845)

(It’s a random snippet from some book (“The Mother and Her Child”) I also found on Project Guttenberg).

This post was inspired by Text Decryption Using MCMC

Jul 24, 2018 - Useful bash script snippets (1)

Useful bash script snippets (1)

Iterate though files in a directory and do something

for f in `ls someDirectory`
do 
	echo $f
done

As one liner:

for f in `ls someDirectory`; do echo $f; done;

You want to do something in directories which match a certain pattern, e.g. start with a number:

IFS=$(echo -en "\n\b")

for d in `find [0-9]* -type d`
do
	echo $d
done

Imagine you have a file called ‘data.txt’ in various subdirectories and you want to process all these files:

for f in /home/floki/**/data.txt
do
	echo $f
done

Process CSV files

Iterate through a CSV file (does not take quotes into account!):

rowNo=0

while IFS='\n' read -r row
do
	echo -n "Row $rowNo: "
	rowNo=$((rowNo+1))
	
	IFS=',' read -ra columns <<< "$row"
	for column in "${columns[@]}"
	do
		# trim leading and trainling white spaces
		column=`echo $column | xargs`
   		echo -n "[$column] "
	done
	echo ""
done < /home/floki/data.csv

Some other useful parameter substitutions

Use a default value if variable is not set:

HOME=${HOME-'/home/floki'} # HOME: /home/floki

String substitution, first occurrence:

dir=${HOME/floki/ragnar/} # dir: /home/ragnar

String substitution, global:

nonsense=${HOME//o/X/} # nonsense: /hXme/flXki

Substring removal

fullPath=/home/floki/test.txt

# Remove from beginning of the string

file=${fullPath##*/}  # file: test.txt`

ext=${file##*.}  # ext: txt

# Remove from the end of the string

name=${file%.*}  # name: test

path=${fullPath%/*}  # path: /home/floki

There are loads of amazing things you can do with Shell Parameter Expansion.

Jul 23, 2018 - Useful shell commands (1)

Useful shell commands (1)

A selection of useful shell commands

Find a program which blocks a certain port

For example listening on port 8080.

lsof -i :8080

Find a file which contains a certain string

For example find all java, xml and properties files which contain a the string myapp.db.user .

grep --include=\*.{java,xml,properties} -rnw "myapp.db.user" .

Check log files

Find specific entries using grep

You only want to see the errors. Grep for ‘error’ ignoring the case. And also add the 5 lines before and the 2 lines after the matching line, because they contain useful information.

grep -i -B 5 -A 2 "error" XYZ.log

Find specific entries using awk

You’re only interested in the 4th and 7th entry (word) of lines containing the string ‘Error’, e. g. because $4 is a filename and $7 is the error code.

cat XYZ.log | awk '/Error/ {print $4 $7}'

Image the log file contains entries like “… Uploaded Files: 5 ….” and you want to sum these figures up to get a total number of uploaded files:

cat XYZ.log | awk '/Uploaded/ {print $5}' | paste -sd+ | bc

SSH port forwarding

Local

Imagine you have a postgres server running on your machine 192.168.11.1 which only listens on localhost port 5432.

But from you desktop machine you want to use a GUI tool like pgAdmin.

As little extra challenge, the SSH server on the remote server doesn’t even listen on the default port 22, but on port 2222.

Solution: Connect to the database server via SSH and use local port forwarding, forwarding your local port 5555 to the server’s port 5432 . Then point pgAdmin to localhost:5555

ssh -L 5555:localhost:5432 user@192.168.11.1 -p 2222

A slightly different case:

Image your database server runs on 192.168.11.2 which is not directly accessible. But you can SSH into 192.168.11.1 and this server can access 192.168.11.2.

In that case simply forward your local port to 192.168.11.2:5432.

ssh -L 5555:192.168.11.2:5432 user@192.168.11.1 -p 2222

Remote

You have a postgres database running on your local machine (port 5432) and you want to test a program which expects the database running on 192.168.11.1 port 1234.

Open a remote tunnel mapping the port 1234 on the remote machine to your local port 5432.

ssh -R 1234:localhost:5432 user@192.168.11.1 

Dynamic (SOCKS)

If you want to tunnel the data from/to your webbrowser through a certain server:

ssh -C -D 1080 192.168.11.1

This will create a SOCKS server listening on 1080 on your local machine. Simply add that for your webbrowser’s proxy configuration.

For the websites you view via this SOCKS proxy it will appear like you came from 192.168.11.1.