Useful R snippets (1)

Coming from ‘traditional’ programming languages I struggled quite a bit getting my head around how to work with R dataframes. If you are in the same situation, have a look at the snippets, hope that’s helpful.

For the snippets I just use the ‘iris’ example dataset

library(datasets)
data("iris")
data.frame(iris)

df <- iris

First of all, forget about arrays for a while. It is tempting, but try to not see an R dataframe as array.

Access specific elements, rows, columns

You can address the column of a dataframe by its name (a bit like an associative array):

df$Sepal.Length

This returns a list, in which you can address the elements with the double bracket notation, e.g. get the first element:

df$Sepal.Length[[1]]

You can also set this element:

df$Sepal.Length[[1]] <- 123

Get all values of a column, e. g. the first column (returns a list!):

df[, 1]

Get all values of a row, e. g. the first row (returns a dataframe!):

df[1, ]

Create subsets of a dataframe

Specific rows

# like before, only include first row
subDf <- df[1, ]

# take a specific range:
subDf <- df[1:10, ]

# only rows which match a specific criteria
subDf <- df[ which(df$Species == 'setosa' | df$Species == 'virginica'), ]

Specific columns

Create a subset with only certain columns:

subDf <- subset(df, select=c('Sepal.Length', 'Sepal.Width'))

Merge rows

By renaming, which requires temporarily conversion from ‘factors’ to ‘characters’.

df$Species <- as.character(df$Species)
df[ df == 'versicolor' ] <- 'versicolorAndVirginica'
df[ df == 'virginica' ] <- 'versicolorAndVirginica'
df$Species <- as.factor(df$Species)

Add a new column

For example for the result of a calculation of another column

df$Sepal.HalfWidth <- df$Sepal.Width / 2

Iterate over dataframe using sapply

Apply a function to each data element and add the result back to dataframe into a new column

square <- function(n) {
  n * n;
}
df$Sepal.Width.Squared <- sapply(df$Sepal.Width, square)
# Could be shorted to:
df$Sepal.Width.Squared.2 <- sapply(df$Sepal.Width, function(n) n * n)

Add rows

df <- rbind(df, c(10, 5, 2.5, 1.5, 'virginica'))

Add columns

df$NewColumn <- (1:150)

Order data by some kind of summary statistic

Can be handy for plotting. E. g. order by mean, low to high, then plot:

ag <-aggregate(x = df[c('Sepal.Length')], by=df[c('Species')], mean)
orderedSpecies <- factor(df$Species, levels=ag[order(ag$Sepal.Length), 'Species'])
plot(df$Sepal.Length ~ orderedSpecies, ylab='Sepal Length', xlab="Species")