Useful R snippets (1)
Coming from ‘traditional’ programming languages I struggled quite a bit getting my head around how to work with R dataframes. If you are in the same situation, have a look at the snippets, hope that’s helpful.
For the snippets I just use the ‘iris’ example dataset
library(datasets)
data("iris")
data.frame(iris)
df <- iris
First of all, forget about arrays for a while. It is tempting, but try to not see an R dataframe as array.
Access specific elements, rows, columns
You can address the column of a dataframe by its name (a bit like an associative array):
df$Sepal.Length
This returns a list, in which you can address the elements with the double bracket notation, e.g. get the first element:
df$Sepal.Length[[1]]
You can also set this element:
df$Sepal.Length[[1]] <- 123
Get all values of a column, e. g. the first column (returns a list!):
df[, 1]
Get all values of a row, e. g. the first row (returns a dataframe!):
df[1, ]
Create subsets of a dataframe
Specific rows
# like before, only include first row
subDf <- df[1, ]
# take a specific range:
subDf <- df[1:10, ]
# only rows which match a specific criteria
subDf <- df[ which(df$Species == 'setosa' | df$Species == 'virginica'), ]
Specific columns
Create a subset with only certain columns:
subDf <- subset(df, select=c('Sepal.Length', 'Sepal.Width'))
Merge rows
By renaming, which requires temporarily conversion from ‘factors’ to ‘characters’.
df$Species <- as.character(df$Species)
df[ df == 'versicolor' ] <- 'versicolorAndVirginica'
df[ df == 'virginica' ] <- 'versicolorAndVirginica'
df$Species <- as.factor(df$Species)
Add a new column
For example for the result of a calculation of another column
df$Sepal.HalfWidth <- df$Sepal.Width / 2
Iterate over dataframe using sapply
Apply a function to each data element and add the result back to dataframe into a new column
square <- function(n) {
n * n;
}
df$Sepal.Width.Squared <- sapply(df$Sepal.Width, square)
# Could be shorted to:
df$Sepal.Width.Squared.2 <- sapply(df$Sepal.Width, function(n) n * n)
Add rows
df <- rbind(df, c(10, 5, 2.5, 1.5, 'virginica'))
Add columns
df$NewColumn <- (1:150)
Order data by some kind of summary statistic
Can be handy for plotting. E. g. order by mean, low to high, then plot:
ag <-aggregate(x = df[c('Sepal.Length')], by=df[c('Species')], mean)
orderedSpecies <- factor(df$Species, levels=ag[order(ag$Sepal.Length), 'Species'])
plot(df$Sepal.Length ~ orderedSpecies, ylab='Sepal Length', xlab="Species")