- Index
- Generating a data.frame containing character data with and without stringsAsFactors
- Growing list vs preallocated list vs lapply
- $ vs [[ operator
- Comparison of two vector values
- R source code vs R compiled code vs C++ code
- Reduce vs vectorized functions
This document compares the performance in doing a task by means of different approaches in R. For doing so, the microbenchmark package is used, measuring the time spent by each approach. The results are shown numerically and plotting them using ggplot2. The numeric table shows relative performances, with the best method as 1.0 and the others showing the number of times which they are worse than the former.
The goal is to elucidate which is the best method to accomplish a certain task.
Index
- Generating a data.frame containing character data with and without
stringsAsFactors - Growing list vs preallocated list vs
lapply $vs[[operator- Comparison of two vector values
- R source code vs R compiled code vs C++ code
Reducevs vectorized functions
Generating a data.frame containing character data with and without stringsAsFactors
With this code I want to test the difference between using stringsAsFactors = TRUE versus stringsAsFactors = FALSE while creating a new data.frame.
numElements <- 1e6
someStrings <- sapply(1:25, function(x) paste(sample(c(letters, LETTERS), 10, replace = TRUE), collapse = ""))
aNumericVector <- runif(numElements)
aStringVector <- sample(someStrings, numElements, replace = TRUE)
bStringVector <- sample(someStrings, numElements, replace = TRUE)
result <- microbenchmark(
data.frame(aNumericVector, aStringVector, bStringVector, stringsAsFactors = TRUE),
data.frame(aNumericVector, aStringVector, bStringVector, stringsAsFactors = FALSE)
)## Unit: relative
## expr min lq mean median uq max
## stringsAsFactors=T 320.012 307.7241 304.4763 255.215 364.2376 378.7762
## stringsAsFactors=F 1.000 1.0000 1.0000 1.000 1.0000 1.0000
## neval
## 100
## 100

Conclusion
Generating a data.frame containing character columns is quicker when stringsAsFactors = FALSE is used. Nonetheless, it may be taken into account that this option implies the use of more memory, as character strings are stored individually instead of as numeric values referencing the factor levels. For this same reason, further operations such as sorting by a character column can take more time (compared with sorting by a factor column).
Growing list vs preallocated list vs lapply
With the code shown below I want to test the differences between creating a list growing it, preallocating the elements, and using the lapply function.
numElements <- 1e4
result <- microbenchmark(
{ v1 <- list() ; for(i in 1:numElements) v1[[i]] <- someStrings },
{ v2 <- vector('list', numElements) ; for(i in 1:numElements) v2[[i]] <- someStrings },
{ v3 <- lapply(1:numElements, function(i) someStrings)}
)## Unit: relative
## expr min lq mean median uq
## Empty list 99.312841 110.056732 101.092425 108.351391 105.734737
## Preallocated list 3.523006 3.497653 3.502606 3.449916 3.530715
## lapply 1.000000 1.000000 1.000000 1.000000 1.000000
## max neval
## 82.68170 100
## 11.73394 100
## 1.00000 100

Conclusion
There is no doubt that growing the list as items are added is a bad idea, since this method is much slower than the other two. The differences between preallocating the list and then populating it with a for loop or generating it with the lapply function are not as large, but certainly lapply has the advantage.
The result should be the same while working with a vector or a data.frame, instead of a list.
$ vs [[ operator
The $ operator is constantly used in R code to access lists and data.frames elements by name. The operator [ could be used to do the same task, using numeric indexes instead. Is there any performance difference between them?
aList <- list( a = 5, b = 'list', c = list(c1 = 25))
result <- microbenchmark(
{ c(aList$a, aList$b, aList$c$c1) },
{ c(aList[[1]], aList[[2]], aList[[2]][[1]]) }
)## Unit: relative
## expr min lq mean median uq max neval
## $ operator 1.750341 1.999318 1.648986 1.800327 1.799346 0.9032428 100
## [[ operator 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100

Conclusion
Although the difference between the two operators is very tight, it should be taken into account if we use these operators inside a loop or any other repetitve structure. Multiply the small difference by the number of times the operator is used during the program execution to assess if the effort worth it.
Comparison of two vector values
Assume that you want to know which items in a vector v (values) have higher values than the corresponding items (by position) in another vector t (threshold). The goal is setting to 0 those values. This is a task that can be accomplished in several ways, for instance:
fgen <- function() runif(numElements, 1, 10)
v <- fgen()
t <- fgen()
result <- microbenchmark(
{ for(i in 1:length(v)) if(v[i] > t[i]) v[i] <- 0 },
{ v <- mapply(function(a,b) if(a > b) 0 else a, v, t) },
{ v[which(v > t)] <- 0 },
{ v[v > t] <- 0 },
{ v <- ifelse(v > t, 0, v) }
)## Unit: relative
## expr min lq mean median uq max
## for 143.970820 143.186430 131.953767 131.256246 138.412731 106.010333
## mapply 393.804541 439.768151 410.576911 385.978346 405.311462 589.016210
## which 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## v > t 5.333333 4.986675 4.288649 4.246307 4.177645 3.478734
## ifelse 37.382439 35.266623 33.990739 29.547565 30.619796 74.242364
## neval
## 100
## 100
## 100
## 100
## 100

As can be seen, mapply produces the worst performance, followed by the for loop. The quickest way to do the work is almost the simplest one, using the which function. This function returns the indexes of elements affected, while with the expression v[v > t] <- 0 an array of the same length than v and t is obtained and all their elements are tested to see if they are TRUE or FALSE before the assignment.
Simple functions can be vectorized by means of the Vectorize function in the base R package. Let us see how this approach performs against the best one of the previous tests:
v <- fgen()
t <- fgen()
f <- function(a, b) if(a > b) 0 else a
vf <- Vectorize(f)
result <- microbenchmark(
{ v[which(v > t)] <- 0 },
{ v <- vf(v, t) }
)## Unit: relative
## expr min lq mean median uq max neval
## which 1.0000 1.0000 1.000 1.000 1.0000 1.000 100
## Vectorize 416.5791 389.8873 401.682 401.581 389.6696 394.292 100

Conclusion
When it comes to apply some change to those items in a vector that satisfy a certain restriction, it seems that firstly obtaining the indexes, with the which function, and then making the change is the most efficient way of those compared here.
R source code vs R compiled code vs C++ code
Sometimes it is not easy to translate a loop into a vectorized expression or a call to apply. For instance, this happens when the operation to be made in a cycle depens on the result of a previous iteration. In these cases the loop R function containing the loop can be translated to bytecode, by means of the cmpfun function of the compiler package. Another alternative would be implementing that loop in C++ taking advantage of the Rcpp package. But, is it worth it?
Let us compare the performance of the same task implemented as a R function, as a compiled R function and as a C++ function:
numElements <- 1e5
v <- fgen()
t <- fgen()
f <- function(v, t) for(i in 1:length(v)) if(v[i] > t[i]) v[i] <- 0
fc <- cmpfun(f)
cppFunction('
void fCpp(NumericVector v, NumericVector t) {
for(int i = 0; i < v.size(); i++)
v[i] = v[i] > t[i] ? 0 : v[i];
}
')
result <- microbenchmark(f(v, t), fc(v, t), fCpp(v, t))## Unit: relative
## expr min lq mean median uq max
## R source 148.50908 142.68063 145.23261 139.38388 139.51544 146.35978
## R compiled 39.12494 40.27761 41.52429 40.69591 41.57373 83.49558
## Rcpp 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
## neval
## 100
## 100
## 100

As can be seen the C++ function, embedded into R code with the cppFunction, is considerably quicker than the other two alternatives. Even compiling to bytecode, without the effort of installing the Rcpp package, can be worth it.
Would be the C++ implementation of this task quicker than the which function based solution proposed in an earlier section? Let us see:
v <- fgen()
t <- fgen()
cppFunction('
void fCpp(NumericVector v, NumericVector t) {
for(int i = 0; i < v.size(); i++)
v[i] = v[i] > t[i] ? 0 : v[i];
}
')
result <- microbenchmark(v[which(v > t)] <- 0, fCpp(v, t))## Unit: relative
## expr min lq mean median uq max neval
## which 1.173733 1.206826 4.280313 1.632834 3.949873 85.94283 100
## Rcpp 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100

Although the improvement provided by the C++ function over which is not impressive, certainly we can save some time if we are comfortable writing C++ code.
Reduce vs vectorized functions
The Reduce function is used to reduce the values stored into a vector by applying the same function to every item and the previous accumulated result. However, sometimes there are better ways to do the same. For instance, Reduce shouldn't be used to obtain the sum of a vector:
numElements <- 1e5
v <- fgen()
result <- microbenchmark(sum(v), Reduce('+', v))## Unit: relative
## expr min lq mean median uq max neval
## sum(v) 1.0000 1.0000 1.0000 1.000 1.0000 1.0000 100
## Reduce("+", v) 280.3035 282.3651 271.0383 255.496 249.5048 427.1206 100

Although the difference is remarkably smaller, Reduce is also slower than the prod function:
result <- microbenchmark(prod(v), Reduce('*', v))## Unit: relative
## expr min lq mean median uq max neval
## prod(v) 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 100
## Reduce("*", v) 2.646571 2.727265 2.870891 2.780134 2.78997 7.104513 100

Sometimes Reduce is used because we aren't aware that a certain function is already vectorized. This is the case of the paste function, which is able to join a vector of strings without any iteration:
numElements <- 1e4
aStringVector <- sample(someStrings, numElements, replace = TRUE)
result <- microbenchmark(paste(aStringVector, collapse = " "), Reduce(paste, aStringVector))## Unit: relative
## expr min lq mean median uq max neval
## paste 1.000 1.000 1.000 1.000 1.00 1.000 100
## Reduce 4037.178 4017.092 4155.169 4202.428 4369.06 3992.414 100

Conclusion
In general, Reduce is a solution to apply an operation to a vector of values when no other alternatives are available. Functions already available in R to do the same task are always more efficient, as can be seen in the previous tests.