environment[[names[j]]][i]<-I does seem to take longer than environment$name[i]<-I when looping over the length of the vector involved. Using a list instead of an environment seems to work better.
Below are 4 functions that populate a data.frame row by row. f0 : the naïve approach f1 : use 3 vectors in an environment, use env$name[i] <- newValue f1a : use n vectors in an environment, use env[[names[j]]][i] <- newValue f2 : like f1a, but using a list, use env[[j]][i] <- newValue I compared times with > nRows <- c(10000, 20000, 40000) > fs <- list(f0, f1, f1a, f2) > times <- matrix(NA, nrow=length(nRows), ncol=length(fs), dimnames=list(nRow=as.character(nRows), func=c("f0","f1","f1a","f2"))) > for(i in seq_along(nRows)) for(j in seq_along(fs)) times[i,j] <- system.time(fs[[j]](nRows[i]))["elapsed"] > times func nRow f0 f1 f1a f2 10000 8.61 0.19 1.79 0.24 20000 22.52 0.36 5.53 0.45 40000 72.11 0.70 17.36 0.88 f1 and f2 scale roughly linearly up 2 million rows and produce identical results: > system.time(r1 <- f1(2e6)) user system elapsed 48.38 2.25 52.72 > system.time(r2 <- f2(2e6)) user system elapsed 46.34 1.47 50.39 > identical(r1, r2) [1] TRUE (All timing done on a 5 year old 32-bit Windows laptop with 2 Gb of RAM.) f0 <- function (nRow) { incrSize <- 10000 curSize <- 10000 data <- data.frame(x = numeric(curSize), y = numeric(curSize), z = numeric(curSize)) for (i in seq_len(nRow)) { if (i > curSize) { data <- rbind(data, data.frame(x = numeric(incrSize), y = numeric(incrSize), z = numeric(incrSize))) curSize <- nrow(data) } data[i, ] <- c(i + 0.1, i + 0.2, i + 0.3) } data[seq_len(nRow), , drop = FALSE] } f1 <- function (nRow) { incrSize <- 10000 curSize <- min(10000, nRow) data <- as.environment(list(x = numeric(curSize), y = numeric(curSize), z = numeric(curSize))) for (i in seq_len(nRow)) { if (i > curSize) { curSize <- min(curSize + incrSize, nRow) for (name in objects(data)) { length(data[[name]]) <- curSize } } data$x[i] <- i + 0.1 data$y[i] <- i + 0.2 data$z[i] <- i + 0.3 } data.frame(as.list(data)[c("x","y","z")]) } f1a <- function (nRow, colNames=c("x","y","z")) { incrSize <- 10000 curSize <- min(10000, nRow) data <- as.environment(list(x = numeric(curSize), y = numeric(curSize), z = numeric(curSize))) colNums <- seq_along(colNames) for (i in seq_len(nRow)) { if (i > curSize) { curSize <- min(curSize + incrSize, nRow) for (name in objects(data)) { length(data[[name]]) <- curSize } } for(j in colNums) { data[[colNames[j]]][i] <- i + j/10 } } data.frame(as.list(data)[colNames]) } f2 <- function (nRow, colNames=c("x","y","z")) { incrSize <- 10000 curSize <- min(10000, nRow) colNums <- seq_along(colNames) data <- lapply(colNames, function(nm) numeric(curSize)) for (i in seq_len(nRow)) { if (i > curSize) { curSize <- min(curSize + incrSize, nRow) for (j in colNums) { length(data[[j]]) <- curSize } } for(j in colNums) { data[[j]][i] <- i + j/10 } } names(data) <- colNames data.frame(data) } Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com From: Peter Meilstrup [mailto:peter.meilst...@gmail.com] Sent: Wednesday, February 15, 2012 1:39 AM To: William Dunlap Cc: r-help@r-project.org Subject: Re: [R] Filling out a data frame row by row.... slow! On Tue, Feb 14, 2012 at 2:31 PM, William Dunlap <wdun...@tibco.com<mailto:wdun...@tibco.com>> wrote: If you must repeatedly append rows to a data.frame, try making the dataset you are filling in a bunch of independent vectors, perhaps in a new environment to keep things organized. One complication is I don't know the names of the columns I'm assigning to before I read them off the file. And crazily, if I change this: data$x[i] <- i + 0.1 where data is an environment and x a primitive vector, to use a computed name instead: data[[colname]][i] <- i + 0.1 Then I get back to way-superlinear performance. Eventually I found I could work around it like: eval(substitute(var[ix] <- data, list(var=as.name<http://as.name>(colname), ix=i, data = i+0.1)), envir = data) but... as workarounds go that seems to be on the crazy nuts end of the scale. Why does [[]] impose a performance penalty? Peter [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.