Aug 3, 2018 - Useful R snippets (1)

Comments

Useful R snippets (1)

Coming from ‘traditional’ programming languages I struggled quite a bit getting my head around how to work with R dataframes. If you are in the same situation, have a look at the snippets, hope that’s helpful.

For the snippets I just use the ‘iris’ example dataset

library(datasets)
data("iris")
data.frame(iris)

df <- iris

First of all, forget about arrays for a while. It is tempting, but try to not see an R dataframe as array.

Access specific elements, rows, columns

You can address the column of a dataframe by its name (a bit like an associative array):

df$Sepal.Length

This returns a list, in which you can address the elements with the double bracket notation, e.g. get the first element:

df$Sepal.Length[[1]]

You can also set this element:

df$Sepal.Length[[1]] <- 123

Get all values of a column, e. g. the first column (returns a list!):

df[, 1]

Get all values of a row, e. g. the first row (returns a dataframe!):

df[1, ]

Create subsets of a dataframe

Specific rows

# like before, only include first row
subDf <- df[1, ]

# take a specific range:
subDf <- df[1:10, ]

# only rows which match a specific criteria
subDf <- df[ which(df$Species == 'setosa' | df$Species == 'virginica'), ]

Specific columns

Create a subset with only certain columns:

subDf <- subset(df, select=c('Sepal.Length', 'Sepal.Width'))

Merge rows

By renaming, which requires temporarily conversion from ‘factors’ to ‘characters’.

df$Species <- as.character(df$Species)
df[ df == 'versicolor' ] <- 'versicolorAndVirginica'
df[ df == 'virginica' ] <- 'versicolorAndVirginica'
df$Species <- as.factor(df$Species)

Add a new column

For example for the result of a calculation of another column

df$Sepal.HalfWidth <- df$Sepal.Width / 2

Iterate over dataframe using sapply

Apply a function to each data element and add the result back to dataframe into a new column

square <- function(n) {
  n * n;
}
df$Sepal.Width.Squared <- sapply(df$Sepal.Width, square)
# Could be shorted to:
df$Sepal.Width.Squared.2 <- sapply(df$Sepal.Width, function(n) n * n)

Add rows

df <- rbind(df, c(10, 5, 2.5, 1.5, 'virginica'))

Add columns

df$NewColumn <- (1:150)

Order data by some kind of summary statistic

Can be handy for plotting. E. g. order by mean, low to high, then plot:

ag <-aggregate(x = df[c('Sepal.Length')], by=df[c('Species')], mean)
orderedSpecies <- factor(df$Species, levels=ag[order(ag$Sepal.Length), 'Species'])
plot(df$Sepal.Length ~ orderedSpecies, ylab='Sepal Length', xlab="Species")

Jul 25, 2018 - Code breaking using a Markov Chain Monte Carlo method

Comments

Code breaking using a Markov Chain Monte Carlo method

Can you decipher this text?

WIUBSIWFBZFIBLRYS BYWBWIKURBOZPIUZBW ZKBHZWZOZBL BUIIPBZBUKYABIW BOZMBDNQUBIG 
KBUR BEIKO KBUIBGYQYUBQ G KZSBGYSSZF QBITBKNQQYZWBA ZQZWUQBL BTINWOBUR 
BEIMQBZWOBFYKSQBITBW ZKSMBUR B WUYK BGYSSZF BQNTT KYWFBTKIXBUKZHRIXZBZBOZWF 
KINQBYWT HUYINQBOYQ ZQ BITBUR B M QBLRYHRBQAK ZOQBZSZKXYWFSMBTKIXBIW 
BHRYSOBUIBZWIUR KBL BQZLBUR BOYQ ZQ BYWBZSSBITBYUQBGZKMYWFBO FK  QBZXIWFBUR 
BHRYSOK WBQIX BITBUR XBRZOBQLISS WBK OO W OBSYOQBZBOYQHRZKF 
BITBANQBLZQBHIXYWFBTKIXBUR B M QBITBIUR KQBZWOBUR 
MBHINSOBWIUBSIIPBUILZKOBZBSYFRUBIKBUR BQNW

This is a substitution cipher (also known as Caesar cipher) which simply replaces a character with another character. You shouldn’t use this encryption method for any serious encryption. You’ll soon see why.

Hint: The original text is English.

If you have a large piece of encrypted text, you could count each character and compare that to the general frequency of characters in English. For example the most common character in English is ‘e’, so the most used character in the encrypted text probably stands for ‘e’. Other common characters probably stand for ‘t’ and ‘a’, etc. You could try to decipher some parts of the text using this method. However the shorter the sample of the encrypted text you have the more vague these assumptions are. We need a better method!

Markov Chain

A Markov process is a process of state changes where each state transition has a specific probability and is only dependant on the current state.

With respect to the text deciphering problem we assume that a text, a sequence of characters is the realization of a Markov process, where each character is a state of the Markov chain.

Let’s take for example the string ‘Markov’. You could read that as the process went from state ‘M’ to state’A’ to state ‘R’ to state ‘K’ to state ‘O’ to state ‘V’. Each transition (M -> A, A -> R, etc.) has a certain probility. Taking the 26 characters of the alphabet, you can store that as 26 x 26 matrix (Note: This matrix is not symmetrical, i. e. M -> A is not necessarily equal to A -> M).

An important question is: What is the likelihood that a certain string is a realization of a certain Markov chain? Lets take the example ‘Markov’ again. What is the likelihood that this string is a realization of a Markov chain with identical transition probabilties ((A->A)=1/26, (A->B)=1/26, (A->C)=1/26, …)? There are 5 transitions which each occur with a probability of 1/26, so it is (1/26)^5 which is 0.000000084165336 . You might already see a problem here. Even with a short string of just 6 characters we’re getting very small probabilities. That’s why usually the log likelihood is used, which is log((1/26)^5) = -16.29048269010721 . That’s a figure we and especially a CPU can work much better with.

In the example we assumed the same probability for each of the transitions between the characters. Of course in reality that’s not the case. In fact each language has a quite characteristic transition probability matrix. For example in English the letter ‘Q’ is nearly always followed by an ‘U’, sometimes by a space, basically never by another letter.

We’ll get a big piece of literature written in English (“War and Peace”), estimate these transition probabilities and build a Markov chain.

How does that help us with the task of deciphering the text?

What we’re trying to achieve is to find a character mapping which decodes the encrypted text, so that the decrypted text has a high likelihood to be a realization of the Markov chain we just constructed.

Monte Carlo

We could now randomly create different character mappings, decode the text with these character mappings, calculate the likelihoods and hope that we find one with a high likelihood in a reasonable amount of time. Is that feasible? Well, assuming 27 characters (26 letters and space) there are 27! (roughly 10^28) possible character mappings! We need a better way to create these mappings!

Lets start with a random mapping, but instead of randomly creating new mappings we successively improve this mapping. For our first guess we could use some prior assumptions we can make about the mapping (see above, taking letter frequencies into account), but the method is so ‘greedy’, that this is not even necessary.

We start from a completely random mapping, decipher the text, calculate the likelihood that this deciphered text is a realization of the Markov chain. Then we slightly modify this mapping by randomly swapping two characters, e. g. A -> D and G -> C, then in the modified mapping G -> D and A -> C. Again decipher the text and calculate the likelihood. If the likelihood is better than the previous mapping, we accept the modified mapping. If it is worse, we might still accept it, with a probability reflecting the difference of the likelihoods. In practice that’s the likelihood ratio. The probability of acceptance is: p = e^(ll_mod - ll) (as we’re working with log likelihoods (ll) the likelihood ratio is the difference between the log likelihoods). If the modified log likelihood is greater than the previous one then p > 1, i. e. the modified mapping is accepted. If it is smaller then p < 1 but there’s still the possibility that we accept it. This is important because it allows the algorithm to escape from a local maximum. And this way of generating potential solutions is called Monte Carlo sampling.

It’s so efficient, that in our encrypted text example this sampling method finds a good solution within a few thousand (< 10000) trials (compare that to the figure 10^28)!

Implementation

You can find the full source code on my Gitlab repository.

Here I just want to point out the principal methods.

We need a method to count the transitions between the characters:

final char[] characters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0 ".toCharArray();

long[][] transitionCounts(String input) {
    long[][] matrix = new long[characters.length][characters.length];
    char[] chars = input.toCharArray();
    char last = chars[0];
    for (int i = 1; i < chars.length; i++) {
        matrix[index(last)][index(chars[i])]++;
        last = chars[i];
    }
    return matrix;
    }

The returned matrix will contain the counts how often a character was followed by another character, e.g. matrix[0][0] is how often an ‘A’ is followed by an ‘A’, matrix[0][1] how often an ‘A’ is followed by a ‘B’, etc.

This count matrix has to be turned into a probability matrix. Note: We have to add 1 to the counts in order to avoid probabilities of 0 (as we’re working with the log of the probabilities).

double[][] transitionProbabilities(long[][] transitionCounts) {
    double[][] matrix = new double[transitionCounts.length][transitionCounts.length];
    for (int i = 0; i < transitionCounts.length; i++) {
        long total = 0;
        for (int j = 0; j < transitionCounts[i].length; j++) {
            total += transitionCounts[i][j] + 1;
        }
        for (int j = 0; j < transitionCounts[i].length; j++) {
            matrix[i][j] = (double) (transitionCounts[i][j] + 1)
                    / (double) total;
        }
    }
    return matrix;
}

A method to calculate the (log) likelihood that a text could have been produced by the markov chain:

double likelihood(String text) {
    double ll = 0;
    char last = text.charAt(0);
    for (int i = 1; i < text.length(); i++) {
        char cur = text.charAt(i);
        ll += Math.log(transitionsProbabilities[index(last)][index(cur)]);
        last = cur;
    }
    return ll;
}

I wrapped these methods up in a class called MarkovChain.java, together with a method which parses a reference text to estimate the transition probabilities.

The algorithm itself (which can be found in MCMCTextDecrypter.java):

final Random rand = new Random();

String decipher(String referenceText, String encryptedText, int iterations) {

    // Create a markov chain and initialize it with the reference text
    MarkovChain mc = new MarkovChain(characters);
    mc.initialize(referenceText);

    // Start off with a random mapping (cipher)
    CharacterMapping mapping = new CharacterMapping(characters);

    // Try to decode the secret text with the random mapping
    String decodedText = mapping.decodeText(encryptedText);

    // Get the likelihood that this decoded text could have
    // been generated from the markov chain
    double maxLL = mc.likelihood(decodedText);

    for (int i = 0; i < iterations; i++) {

        // Randomly swap two characters of the mapping
        CharacterMapping trial = mapping.clone();
        trial.randomSwap();

        // Try again to decrypt the text
        String tmp = trial.decodeText(encryptedText);

        // If the likelihood is better (the likelihood ratio is > 1) 
        // accept the modified mapping, if not accept it only with
        // a probability equal to the likelihood ratio.
        double ll = mc.likelihood(tmp);
        if (rand.nextDouble() < Math.exp(ll - maxLL)) {
            maxLL = ll;
            decodedText = tmp;
            mapping = trial;
        }
    }

    return decodedText;
}

Running it with 10000 iterations gives you a pretty good chance to decipher the text.

Checkout the code and play around a bit with the Main.java class.

You will notice that:

  • You sometimes just get glibberish instead of the decrypted text. The MCMC algorithm doesn’t guarantee to find the highest likelihood possible. The original text is not the only text which has a high likelihood to be a result of the markov chain. Other texts although complete nonesense will have similar likelihoods. In fact…

  • The original text does not even have the highest likelihood possible. Slightly different character mappings can have a higher likelihood than the real one which was used to encrypt the text. For example you’ll quite often see a ‘QUST’ instead of ‘JUST’, because although not a word, it has a higher likelihood.

Here’s an example output of a run:

Start log likelihood: -3110.9801413752443
Iterations: 10000
Accepted (%): 2.82
Rejected (%): 97.18
Log likelihood of solution: -1217.6445330437575
Estimated cipher which was used:
ABCDEFGHIJKLMNOPQRSTUVWXYZ0 
||||||||||||||||||||||||||||
ZEHO TFRY0PSXWIADKQUNGLCMJVB

Decrypted text:
NOT LONG AGO WHILE IN NORTH DAKOTA NEAR CANADA WE TOOK A TRIP ONE DAY QUST OVER 
THE BORDER TO VISIT SEVERAL VILLAGES OF RUSSIAN PEASANTS WE FOUND THE BOYS AND 
GIRLS OF NEARLY THE ENTIRE VILLAGE SUFFERING FROM TRACHOMA A DANGEROUS 
INFECTIOUS DISEASE OF THE EYES WHICH SPREADS ALARMINGLY FROM ONE CHILD TO 
ANOTHER WE SAW THE DISEASE IN ALL OF ITS VARYING DEGRES AMONG THE CHILDREN SOME 
OF THEM HAD SWOLLEN REDDENED LIDS A DISCHARGE OF PUS WAS COMING FROM THE EYES OF 
OTHERS AND THEY COULD NOT LOOK TOWARD A LIGHT OR THE SUN

(Log likelihood of the original text: -1221.9418810429845)

(It’s a random snippet from some book (“The Mother and Her Child”) I also found on Project Guttenberg).

This post was inspired by Text Decryption Using MCMC

Jul 24, 2018 - Useful bash script snippets (1)

Comments

Useful bash script snippets (1)

Iterate though files in a directory and do something

for f in `ls someDirectory`
do 
	echo $f
done

As one liner:

for f in `ls someDirectory`; do echo $f; done;

You want to do something in directories which match a certain pattern, e.g. start with a number:

IFS=$(echo -en "\n\b")

for d in `find [0-9]* -type d`
do
	echo $d
done

Imagine you have a file called ‘data.txt’ in various subdirectories and you want to process all these files:

for f in /home/floki/**/data.txt
do
	echo $f
done

Process CSV files

Iterate through a CSV file (does not take quotes into account!):

rowNo=0

while IFS='\n' read -r row
do
	echo -n "Row $rowNo: "
	rowNo=$((rowNo+1))
	
	IFS=',' read -ra columns <<< "$row"
	for column in "${columns[@]}"
	do
		# trim leading and trainling white spaces
		column=`echo $column | xargs`
   		echo -n "[$column] "
	done
	echo ""
done < /home/floki/data.csv

Some other useful parameter substitutions

Use a default value if variable is not set:

HOME=${HOME-'/home/floki'} # HOME: /home/floki

String substitution, first occurrence:

dir=${HOME/floki/ragnar/} # dir: /home/ragnar

String substitution, global:

nonsense=${HOME//o/X/} # nonsense: /hXme/flXki

Substring removal

fullPath=/home/floki/test.txt

# Remove from beginning of the string

file=${fullPath##*/}  # file: test.txt`

ext=${file##*.}  # ext: txt

# Remove from the end of the string

name=${file%.*}  # name: test

path=${fullPath%/*}  # path: /home/floki

There are loads of amazing things you can do with Shell Parameter Expansion.