I have always known that "matrices are faster than data frames", for instance this function:

dumkoll <- function(n = 1000, df = TRUE){
    dfr <- data.frame(x = rnorm(n), y = rnorm(n))
    if (df){
        for (i in 2:NROW(dfr)){
            if (!(i %% 100)) cat("i = ", i, "\n")
            dfr$x[i] <- dfr$x[i-1]
        }
    }else{
        dm <- as.matrix(dfr)
        for (i in 2:NROW(dm)){
            if (!(i %% 100)) cat("i = ", i, "\n")
            dm[i, 1] <- dm[i-1, 1]
        }
        dfr$x <- dm[, 1]
    }
}

--------------------
> system.time(dumkoll())

   user  system elapsed
  0.046   0.000   0.045

> system.time(dumkoll(df = FALSE))

   user  system elapsed
  0.007   0.000   0.008
----------------------

OK, no big deal, but I stumbled over a data frame with one million records. Then, with df = TRUE,
----------------------------
     user    system   elapsed
44677.141  1271.544 46016.754
----------------------------
This is around 12 hours.

With df = FALSE, it took only six seconds! About 7500 time faster.

I was really surprised by the huge difference, and I wonder if this is to be expected, or if it is some peculiarity with my installation: I'm running Ubuntu 13.10 on a MacBook Pro with 8 Gb memory, R-3.0.3.

Göran B.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to