If you must repeatedly append rows to a data.frame, try making the dataset you are filling in a bunch of independent vectors, perhaps in a new environment to keep things organized, and expand each at the same time. At the very end make a data.frame out of those vectors. E.g., change the likes of
f0 <- function (nRow) { incrSize <- 10000 curSize <- 10000 data <- data.frame(x = numeric(curSize), y = numeric(curSize), z = numeric(curSize)) for (i in seq_len(nRow)) { if (i > curSize) { data <- rbind(data, data.frame(x = numeric(incrSize), y = numeric(incrSize), z = numeric(incrSize))) curSize <- nrow(data) } data[i, ] <- c(i + 0.1, i + 0.2, i + 0.3) } data[seq_len(nRow), , drop = FALSE] } to f1 <- function (nRow) { incrSize <- 10000 curSize <- min(10000, nRow) data <- as.environment(list(x = numeric(curSize), y = numeric(curSize), z = numeric(curSize))) for (i in seq_len(nRow)) { if (i > curSize) { curSize <- min(curSize + incrSize, nRow) for (name in objects(data)) { length(data[[name]]) <- curSize } } data$x[i] <- i + 0.1 data$y[i] <- i + 0.2 data$z[i] <- i + 0.3 } data.frame(as.list(data)) # use x=data$x, y=data$y, ... if order is important. } Here are some timing results for the above functions > system.time(r1 <- f1(5000)) user system elapsed 0.13 0.00 0.14 > system.time(r1 <- f1(15000)) user system elapsed 0.33 0.00 0.32 > system.time(r1 <- f1(25000)) user system elapsed 0.51 0.00 0.47 > > system.time(r0 <- f0(5000)) user system elapsed 5.23 0.02 5.13 > system.time(r0 <- f0(15000)) user system elapsed 21.75 0.00 20.67 > system.time(r0 <- f0(25000)) user system elapsed 87.31 0.01 86.00 > # results are same, except for the order of the columns > all.equal(r0[, c("x","y","z")], r1[, c("x","y","z")]) [1] TRUE For 2 million rows f1 is getting a little superlinear: 2e6/25000 * .5 = 40 seconds, if time linear in nRow, but I get 55 s. > system.time(r1 <- f1(2e6)) user system elapsed 52.19 3.81 54.69 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf Of Peter Meilstrup > Sent: Tuesday, February 14, 2012 1:47 PM > To: r-help@r-project.org > Subject: [R] Filling out a data frame row by row.... slow! > > I'm reading a file and using the file to populate a data frame. The way the > file is laid out, I need to fill in the data frame one row at a time. > > When I start reading my file, I don't know how many rows I will need. It's > on the order of a million. > > Being mindful of the time expense of reallocation, I decided on a strategy > of doubling the data frame size every time I needed to expand it ... > therefore memory is never more than 50% wasted, and it should still finish > in O(N) time. > > But it still, somehow has an O(N^2) performance characteristic. It seems > like just setting a single element is slow in a larger data frame as > compared to a smaller one. Here is a toy function to illustrate, > reallocating and filling in single rows in a data frame, and shows the > slowdown: > > populate.data.frame.test <- function(n=1000000, chunk=1000) { > i = 0; > df <- data.frame(a=numeric(0), b=numeric(0), c=numeric(0)); > t <- proc.time()[2] > for (i in 1:n) { > if (i %% chunk == 0) { > elapsed <- -(t - (t <- proc.time()[2])) > cat(sprintf("%d rows: %g rows per sec, nrows = %d\n", i, > chunk/elapsed, nrow(df))) > > } > > ##double data frame size if necessary > while (nrow(df)<i) { > df[max(i, 2*nrow(df)),] <- NA > cat(sprintf("Doubled to %d rows\n", nrow(df))); > } > > ##fill in one row > df[i, c('a', 'b', 'c')] <- list(runif(1), i, runif(1)) > } > } > > Is there a way to do this that avoids the slowdown? The data cannot be > represented as a matrix (different columns have different types.) > > Peter > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.