I have always known that "matrices are faster than data frames", for
instance this function:
dumkoll <- function(n = 1000, df = TRUE){
dfr <- data.frame(x = rnorm(n), y = rnorm(n))
if (df){
for (i in 2:NROW(dfr)){
if (!(i %% 100)) cat("i = ", i, "\n")
dfr$x[i] <- dfr$x[i-1]
}
}else{
dm <- as.matrix(dfr)
for (i in 2:NROW(dm)){
if (!(i %% 100)) cat("i = ", i, "\n")
dm[i, 1] <- dm[i-1, 1]
}
dfr$x <- dm[, 1]
}
}
--------------------
> system.time(dumkoll())
user system elapsed
0.046 0.000 0.045
> system.time(dumkoll(df = FALSE))
user system elapsed
0.007 0.000 0.008
----------------------
OK, no big deal, but I stumbled over a data frame with one million
records. Then, with df = TRUE,
----------------------------
user system elapsed
44677.141 1271.544 46016.754
----------------------------
This is around 12 hours.
With df = FALSE, it took only six seconds! About 7500 time faster.
I was really surprised by the huge difference, and I wonder if this is
to be expected, or if it is some peculiarity with my installation: I'm
running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.
Göran B.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.