- Index
- Generating a data.frame containing character data with and without stringsAsFactors
- Conclusion
- Growing list vs preallocated list vs lapply
- Conclusion
- $ vs [[ operator
- Conclusion
- Comparison of two vector values
- Conclusion
- R source code vs R compiled code vs C++ code
- Reduce vs vectorized functions
- Conclusion

This document compares the performance in doing a task by means of different approaches in R. For doing so, the `microbenchmark`

package is used, measuring the time spent by each approach. The results are shown numerically and plotting them using `ggplot2`

. The numeric table shows relative performances, with the best method as `1.0`

and the others showing the number of times which they are worse than the former.

The goal is to elucidate which is the best method to accomplish a certain task.

## Index

- Generating a data.frame containing character data with and without
`stringsAsFactors`

- Growing list vs preallocated list vs
`lapply`

`$`

vs`[[`

operator- Comparison of two vector values
- R source code vs R compiled code vs C++ code
`Reduce`

vs vectorized functions

## Generating a data.frame containing character data with and without stringsAsFactors

With this code I want to test the difference between using `stringsAsFactors = TRUE`

versus `stringsAsFactors = FALSE`

while creating a new data.frame.

```
numElements <- 1e6
someStrings <- sapply(1:25, function(x) paste(sample(c(letters, LETTERS), 10, replace = TRUE), collapse = ""))
aNumericVector <- runif(numElements)
aStringVector <- sample(someStrings, numElements, replace = TRUE)
bStringVector <- sample(someStrings, numElements, replace = TRUE)
result <- microbenchmark(
data.frame(aNumericVector, aStringVector, bStringVector, stringsAsFactors = TRUE),
data.frame(aNumericVector, aStringVector, bStringVector, stringsAsFactors = FALSE)
)
```

```
## Unit: relative
## expr min lq mean median uq max
## stringsAsFactors=T 320.012 307.7241 304.4763 255.215 364.2376 378.7762
## stringsAsFactors=F 1.000 1.0000 1.0000 1.000 1.0000 1.0000
## neval
## 100
## 100
```

### Conclusion

Generating a `data.frame`

containing character columns is quicker when `stringsAsFactors = FALSE`

is used. Nonetheless, it may be taken into account that this option implies the use of more memory, as character strings are stored individually instead of as numeric values referencing the factor levels. For this same reason, further operations such as sorting by a character column can take more time (compared with sorting by a factor column).

## Growing list vs preallocated list vs lapply

With the code shown below I want to test the differences between creating a list growing it, preallocating the elements, and using the `lapply`

function.

```
numElements <- 1e4
result <- microbenchmark(
{ v1 <- list() ; for(i in 1:numElements) v1[[i]] <- someStrings },
{ v2 <- vector('list', numElements) ; for(i in 1:numElements) v2[[i]] <- someStrings },
{ v3 <- lapply(1:numElements, function(i) someStrings)}
)
```

```
## Unit: relative
## expr min lq mean median uq
## Empty list 99.312841 110.056732 101.092425 108.351391 105.734737
## Preallocated list 3.523006 3.497653 3.502606 3.449916 3.530715
## lapply 1.000000 1.000000 1.000000 1.000000 1.000000
## max neval
## 82.68170 100
## 11.73394 100
## 1.00000 100
```

### Conclusion

There is no doubt that growing the list as items are added is a bad idea, since this method is much slower than the other two. The differences between preallocating the list and then populating it with a `for`

loop or generating it with the `lapply`

function are not as large, but certainly `lapply`

has the advantage.

The result should be the same while working with a vector or a data.frame, instead of a list.

## $ vs [[ operator

The `$`

operator is constantly used in R code to access lists and data.frames elements by name. The operator `[`

could be used to do the same task, using numeric indexes instead. Is there any performance difference between them?

```
aList <- list( a = 5, b = 'list', c = list(c1 = 25))
result <- microbenchmark(
{ c(aList$a, aList$b, aList$c$c1) },
{ c(aList[[1]], aList[[2]], aList[[2]][[1]]) }
)
```

```
## Unit: relative
## expr min lq mean median uq max neval
## $ operator 1.750341 1.999318 1.648986 1.800327 1.799346 0.9032428 100
## [[ operator 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100
```

### Conclusion

Although the difference between the two operators is very tight, it should be taken into account if we use these operators inside a loop or any other repetitve structure. Multiply the small difference by the number of times the operator is used during the program execution to assess if the effort worth it.

## Comparison of two vector values

Assume that you want to know which items in a vector `v`

(values) have higher values than the corresponding items (by position) in another vector `t`

(threshold). The goal is setting to 0 those values. This is a task that can be accomplished in several ways, for instance:

```
fgen <- function() runif(numElements, 1, 10)
v <- fgen()
t <- fgen()
result <- microbenchmark(
{ for(i in 1:length(v)) if(v[i] > t[i]) v[i] <- 0 },
{ v <- mapply(function(a,b) if(a > b) 0 else a, v, t) },
{ v[which(v > t)] <- 0 },
{ v[v > t] <- 0 },
{ v <- ifelse(v > t, 0, v) }
)
```

```
## Unit: relative
## expr min lq mean median uq max
## for 143.970820 143.186430 131.953767 131.256246 138.412731 106.010333
## mapply 393.804541 439.768151 410.576911 385.978346 405.311462 589.016210
## which 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
## v > t 5.333333 4.986675 4.288649 4.246307 4.177645 3.478734
## ifelse 37.382439 35.266623 33.990739 29.547565 30.619796 74.242364
## neval
## 100
## 100
## 100
## 100
## 100
```

As can be seen, `mapply`

produces the worst performance, followed by the `for`

loop. The quickest way to do the work is almost the simplest one, using the `which`

function. This function returns the indexes of elements affected, while with the expression `v[v > t] <- 0`

an array of the same length than `v`

and `t`

is obtained and all their elements are tested to see if they are `TRUE`

or `FALSE`

before the assignment.

Simple functions can be vectorized by means of the `Vectorize`

function in the base R package. Let us see how this approach performs against the best one of the previous tests:

```
v <- fgen()
t <- fgen()
f <- function(a, b) if(a > b) 0 else a
vf <- Vectorize(f)
result <- microbenchmark(
{ v[which(v > t)] <- 0 },
{ v <- vf(v, t) }
)
```

```
## Unit: relative
## expr min lq mean median uq max neval
## which 1.0000 1.0000 1.000 1.000 1.0000 1.000 100
## Vectorize 416.5791 389.8873 401.682 401.581 389.6696 394.292 100
```

### Conclusion

When it comes to apply some change to those items in a vector that satisfy a certain restriction, it seems that firstly obtaining the indexes, with the `which`

function, and then making the change is the most efficient way of those compared here.

## R source code vs R compiled code vs C++ code

Sometimes it is not easy to translate a loop into a vectorized expression or a call to `apply`

. For instance, this happens when the operation to be made in a cycle depens on the result of a previous iteration. In these cases the loop R function containing the loop can be translated to bytecode, by means of the `cmpfun`

function of the `compiler`

package. Another alternative would be implementing that loop in C++ taking advantage of the `Rcpp`

package. But, is it worth it?

Let us compare the performance of the same task implemented as a R function, as a compiled R function and as a C++ function:

```
numElements <- 1e5
v <- fgen()
t <- fgen()
f <- function(v, t) for(i in 1:length(v)) if(v[i] > t[i]) v[i] <- 0
fc <- cmpfun(f)
cppFunction('
void fCpp(NumericVector v, NumericVector t) {
for(int i = 0; i < v.size(); i++)
v[i] = v[i] > t[i] ? 0 : v[i];
}
')
result <- microbenchmark(f(v, t), fc(v, t), fCpp(v, t))
```

```
## Unit: relative
## expr min lq mean median uq max
## R source 148.50908 142.68063 145.23261 139.38388 139.51544 146.35978
## R compiled 39.12494 40.27761 41.52429 40.69591 41.57373 83.49558
## Rcpp 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
## neval
## 100
## 100
## 100
```

As can be seen the C++ function, embedded into R code with the `cppFunction`

, is considerably quicker than the other two alternatives. Even compiling to bytecode, without the effort of installing the `Rcpp`

package, can be worth it.

Would be the C++ implementation of this task quicker than the `which`

function based solution proposed in an earlier section? Let us see:

```
v <- fgen()
t <- fgen()
cppFunction('
void fCpp(NumericVector v, NumericVector t) {
for(int i = 0; i < v.size(); i++)
v[i] = v[i] > t[i] ? 0 : v[i];
}
')
result <- microbenchmark(v[which(v > t)] <- 0, fCpp(v, t))
```

```
## Unit: relative
## expr min lq mean median uq max neval
## which 1.173733 1.206826 4.280313 1.632834 3.949873 85.94283 100
## Rcpp 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
```

Although the improvement provided by the C++ function over `which`

is not impressive, certainly we can save some time if we are comfortable writing C++ code.

## Reduce vs vectorized functions

The `Reduce`

function is used to reduce the values stored into a vector by applying the same function to every item and the previous accumulated result. However, sometimes there are better ways to do the same. For instance, `Reduce`

shouldn't be used to obtain the sum of a vector:

```
numElements <- 1e5
v <- fgen()
result <- microbenchmark(sum(v), Reduce('+', v))
```

```
## Unit: relative
## expr min lq mean median uq max neval
## sum(v) 1.0000 1.0000 1.0000 1.000 1.0000 1.0000 100
## Reduce("+", v) 280.3035 282.3651 271.0383 255.496 249.5048 427.1206 100
```

Although the difference is remarkably smaller, `Reduce`

is also slower than the `prod`

function:

`result <- microbenchmark(prod(v), Reduce('*', v))`

```
## Unit: relative
## expr min lq mean median uq max neval
## prod(v) 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 100
## Reduce("*", v) 2.646571 2.727265 2.870891 2.780134 2.78997 7.104513 100
```

Sometimes `Reduce`

is used because we aren't aware that a certain function is already vectorized. This is the case of the `paste`

function, which is able to join a vector of strings without any iteration:

```
numElements <- 1e4
aStringVector <- sample(someStrings, numElements, replace = TRUE)
result <- microbenchmark(paste(aStringVector, collapse = " "), Reduce(paste, aStringVector))
```

```
## Unit: relative
## expr min lq mean median uq max neval
## paste 1.000 1.000 1.000 1.000 1.00 1.000 100
## Reduce 4037.178 4017.092 4155.169 4202.428 4369.06 3992.414 100
```

### Conclusion

In general, `Reduce`

is a solution to apply an operation to a vector of values when no other alternatives are available. Functions already available in R to do the same task are always more efficient, as can be seen in the previous tests.