environment[[names[j]]][i]<-I does seem to take longer than
environment$name[i]<-I when looping over the length of the
vector involved.  Using a list instead of an environment seems
to work better.

Below are 4 functions that populate a data.frame row by row.
   f0 : the naïve approach
   f1 : use 3 vectors in an environment, use env$name[i] <- newValue
   f1a : use n vectors in an environment, use env[[names[j]]][i] <- newValue
   f2 : like f1a, but using a list, use env[[j]][i] <- newValue

I compared times with
  > nRows <- c(10000, 20000, 40000)
  > fs <- list(f0, f1, f1a, f2)
  > times <- matrix(NA, nrow=length(nRows), ncol=length(fs), 
dimnames=list(nRow=as.character(nRows), func=c("f0","f1","f1a","f2")))
  > for(i in seq_along(nRows)) for(j in seq_along(fs)) times[i,j] <- 
system.time(fs[[j]](nRows[i]))["elapsed"]
  > times
        func
  nRow       f0   f1   f1a   f2
   10000  8.61 0.19  1.79 0.24
   20000 22.52 0.36  5.53 0.45
   40000 72.11 0.70 17.36 0.88

f1 and f2 scale roughly linearly up 2 million rows and produce
identical results:
  > system.time(r1 <- f1(2e6))
    user  system elapsed
    48.38    2.25   52.72
  > system.time(r2 <- f2(2e6))
    user  system elapsed
    46.34    1.47   50.39
  > identical(r1, r2)
  [1] TRUE

(All timing done on a 5 year old 32-bit Windows laptop with 2 Gb of RAM.)

f0 <- function (nRow)
{
    incrSize <- 10000
    curSize <- 10000
    data <- data.frame(x = numeric(curSize), y = numeric(curSize),
        z = numeric(curSize))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            data <- rbind(data, data.frame(x = numeric(incrSize),
                y = numeric(incrSize), z = numeric(incrSize)))
            curSize <- nrow(data)
        }
        data[i, ] <- c(i + 0.1, i + 0.2, i + 0.3)
    }
    data[seq_len(nRow), , drop = FALSE]
}


f1 <- function (nRow)
{
    incrSize <- 10000
    curSize <- min(10000, nRow)
    data <- as.environment(list(x = numeric(curSize), y = numeric(curSize),
        z = numeric(curSize)))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            curSize <- min(curSize + incrSize, nRow)
            for (name in objects(data)) {
                length(data[[name]]) <- curSize
            }
        }
        data$x[i] <- i + 0.1
        data$y[i] <- i + 0.2
        data$z[i] <- i + 0.3
    }
    data.frame(as.list(data)[c("x","y","z")])
}

f1a <- function (nRow, colNames=c("x","y","z"))
{
    incrSize <- 10000
    curSize <- min(10000, nRow)
    data <- as.environment(list(x = numeric(curSize), y = numeric(curSize),
        z = numeric(curSize)))
    colNums <- seq_along(colNames)
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            curSize <- min(curSize + incrSize, nRow)
            for (name in objects(data)) {
                length(data[[name]]) <- curSize
            }
        }
        for(j in colNums) {
            data[[colNames[j]]][i] <- i + j/10
        }
    }
    data.frame(as.list(data)[colNames])
}

f2 <- function (nRow, colNames=c("x","y","z"))
{
    incrSize <- 10000
    curSize <- min(10000, nRow)
    colNums <- seq_along(colNames)
    data <- lapply(colNames, function(nm) numeric(curSize))
    for (i in seq_len(nRow)) {
        if (i > curSize) {
            curSize <- min(curSize + incrSize, nRow)
            for (j in colNums) {
                length(data[[j]]) <- curSize
            }
        }
        for(j in colNums) {
            data[[j]][i] <- i + j/10
        }
    }
    names(data) <- colNames
    data.frame(data)
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
From: Peter Meilstrup [mailto:peter.meilst...@gmail.com]
Sent: Wednesday, February 15, 2012 1:39 AM
To: William Dunlap
Cc: r-help@r-project.org
Subject: Re: [R] Filling out a data frame row by row.... slow!

On Tue, Feb 14, 2012 at 2:31 PM, William Dunlap 
<wdun...@tibco.com<mailto:wdun...@tibco.com>> wrote:
If you must repeatedly append rows to a data.frame,
try making the dataset you are filling in a bunch
of independent vectors, perhaps in a new environment
to keep things organized.

One complication is I don't know the names of the columns I'm assigning to 
before I read them off the file. And crazily, if I change this:
       data$x[i] <- i + 0.1

where data is an environment and x a primitive vector, to use a computed name 
instead:

 data[[colname]][i] <- i + 0.1

Then I get back to way-superlinear performance. Eventually I found I could work 
around it like:

eval(substitute(var[ix] <- data,
                          list(var=as.name<http://as.name>(colname), ix=i, data 
= i+0.1)),
               envir = data)

but... as workarounds go that seems to be on the crazy nuts end of the scale. 
Why does [[]] impose a performance penalty?

Peter

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to