Hello, I'm mailing r-devel because I think the performance problem I'm having is best solved by re-writing reshape() in C. I've been reading the "writing R extensions" documentation and I have some questions about they best way to write the C bits, but first let me describe my problem:
I'm trying to reshape a long data frame with ~70 subjects measured at ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into a wide data frame with one column per locus ("time"). On my data (~300,000 rows for chromosome 1) it takes about twenty minutes on a 3.4GHz P4. Here's an R session that demonstrates it (this is pretty close to how my data actually looks): > version _ platform i686-redhat-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 3.1 year 2006 month 06 day 01 svn rev 38247 language R version.string Version 2.3.1 (2006-06-01) > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), genotype=I(as.character(as.integer(runif(3e5, 1, 100))))) > system.time(testWide <- reshape(test, v.names=c("genotype"), timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) [1] 1107.506 120.755 1267.568 0.000 0.000 I believe that this can be done a lot faster, and I think the problem comes from this loop in reshape.R (in reshapeWide): for(i in seq(length = length(times))) { thistime <- data[data[,timevar] %in% times[i],] rval[,varying[,i]] <- thistime[match(rval[,idvar],thistime[,idvar]), v.names] } I don't really understand everything going on under the hood here, but I believe the complexity of this loop is something like O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile shows the lion's share (90%) of the time being spent in [.data.frame. What I'd like to do is write a C loop to go through the source (long) data frame once (rather than once per time), and put the values into the destination rows/columns using hash tables to look up the right row/column. Assuming the hashing is constant-time bounded, then the reshape becomes O(nrow(data)*length(varying)). I'd like to use the abitrary-R-vector hashing functionality from unique.c, but I'm not sure how to do it. I think that functionality is useful to other C code, but the functions that I'm interested in are not part of the R api (they're defined with "static"). Assuming copy/paste is out, I can see two options: 1. to remove "static" from the declarations of some of the functions, and add prototypes for those functions to a new src/library/stats/src/reshape.c file (or to one of the header files in src/include), or 2. to add C functions to do the reshaping to src/main/unique.c and call those from src/library/stats/R/reshape.R. This is all assuming that it makes sense to include this in mainline R--obviously I think it's worthwhile, and I'm surprised other people aren't complaining more. I would be happy to write/rewrite until y'all are happy with how it's done, of course. I've written a proof-of-concept by copying and pasting the hashing functions, which (on the above data frame) runs 20 times faster than the R version of reshape. It still needs some debugging, to put it mildly (doesn't work properly on reals), but the basic idea appears to work. The change I made to the reshape R function looks like this: ===================== for(i in seq(length = length(times))) { - thistime <- data[data[,timevar] %in% times[i],] - rval[,varying[,i]] <- thistime[match(rval[,idvar],thistime[,idvar]), - v.names] + for (j in seq(length(v.names))) { + rval[,varying[j,i]] <- rep(vector(mode=typeof(data[[v.names[j]]]), 0), + length.out=nrow(rval)) + } } + colMap <- match(varying, names(rval)) + .Call("do_reshape_wide", + data[[idvar]], data[[timevar]], data[v.names], + rval, colMap, + v.names, times, rval[[idvar]]) + if (!is.null(new.row.names)) row.names(rval) <- new.row.names ===================== This part: rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval)) is to initialize the output with appropriately-typed vectors full of NAs; if there's a better/right way to do this please let me know. The do_reshape_wide C function looks like this: ===================== SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData, SEXP wideFrame, SEXP colMap, SEXP vnames, SEXP times, SEXP wideIds) { HashData idHash, timeHash; int longRows, numVarying; int rowNum, varying; int wideRow, curTime; SEXP wideCol, longCol; HashTableSetup(wideIds, &idHash); PROTECT(idHash.HashTable); DoHashing(wideIds, &idHash); HashTableSetup(times, &timeHash); PROTECT(timeHash.HashTable); DoHashing(times, &timeHash); longRows = length(longIds); numVarying = length(vnames); for (rowNum = 0; rowNum < longRows; rowNum++) { /* Lookup returns 1-based answers */ wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1; curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1; for (varying = 0; varying < numVarying; varying++) { /* colMap is 1-based */ wideCol = VECTOR_ELT(wideFrame, INTEGER(colMap)[(numVarying * curTime) + varying] - 1); longCol = VECTOR_ELT(longData, varying); SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, rowNum)); } } UNPROTECT(2); } ===================== None examples I recall from "Writing R extensions" had void C function return types; is that a rule? I've put the code up here: http://arctur.us/r/creshape.tar.gz I've never tried to make a package before, but you should be able to just do a R CMD INSTALL on it, if you want to try it out. Here's an R session using that code: > require(creshape) Loading required package: creshape [1] TRUE > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), genotype=I(as.character(as.integer(runif(3e5, 1, 100))))) > system.time(testWide <- creshape(test, v.names=c("genotype"), timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) [1] 60.756 1.540 62.598 0.000 0.000 > system.time(testWide <- reshape(test, v.names=c("genotype"), timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) [1] 1278.231 78.389 1387.739 0.000 0.000 Any comments are appreciated, Mitch ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel