If for each combination of X and Y there is at most one Z and Z is numeric (it could be made so in your example) then you could use xtabs which is faster:
> m <- n <- 10 > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m))) > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) [1] 0.33 0.00 0.33 NA NA > system.time(w2 <- xtabs(Z ~ Y + X, DF)) [1] 0.01 0.00 0.01 NA NA On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote: > Hello, > > I'm mailing r-devel because I think the performance problem I'm having > is best solved by re-writing reshape() in C. I've been reading the > "writing R extensions" documentation and I have some questions about > they best way to write the C bits, but first let me describe my problem: > > I'm trying to reshape a long data frame with ~70 subjects measured at > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into > a wide data frame with one column per locus ("time"). On my data > (~300,000 rows for chromosome 1) it takes about twenty minutes on a > 3.4GHz P4. Here's an R session that demonstrates it (this is pretty > close to how my data actually looks): > > > version > _ > platform i686-redhat-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 2 > minor 3.1 > year 2006 > month 06 > day 01 > svn rev 38247 > language R > version.string Version 2.3.1 (2006-06-01) > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > genotype=I(as.character(as.integer(runif(3e5, 1, 100))))) > > system.time(testWide <- reshape(test, v.names=c("genotype"), > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 1107.506 120.755 1267.568 0.000 0.000 > > I believe that this can be done a lot faster, and I think the problem > comes from this loop in reshape.R (in reshapeWide): > > for(i in seq(length = length(times))) { > thistime <- data[data[,timevar] %in% times[i],] > rval[,varying[,i]] <- > thistime[match(rval[,idvar],thistime[,idvar]), > v.names] > } > > I don't really understand everything going on under the hood here, but I > believe the complexity of this loop is something like > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile > shows the lion's share (90%) of the time being spent in [.data.frame. > > What I'd like to do is write a C loop to go through the source (long) > data frame once (rather than once per time), and put the values into the > destination rows/columns using hash tables to look up the right > row/column. > > Assuming the hashing is constant-time bounded, then the reshape becomes > O(nrow(data)*length(varying)). > > I'd like to use the abitrary-R-vector hashing functionality from > unique.c, but I'm not sure how to do it. I think that functionality is > useful to other C code, but the functions that I'm interested in are not > part of the R api (they're defined with "static"). Assuming copy/paste > is out, I can see two options: 1. to remove "static" from the > declarations of some of the functions, and add prototypes for those > functions to a new src/library/stats/src/reshape.c file (or to one of > the header files in src/include), or 2. to add C functions to do the > reshaping to src/main/unique.c and call those from > src/library/stats/R/reshape.R. > > This is all assuming that it makes sense to include this in mainline > R--obviously I think it's worthwhile, and I'm surprised other people > aren't complaining more. I would be happy to write/rewrite until y'all > are happy with how it's done, of course. > > I've written a proof-of-concept by copying and pasting the hashing > functions, which (on the above data frame) runs 20 times faster than the > R version of reshape. It still needs some debugging, to put it mildly > (doesn't work properly on reals), but the basic idea appears to work. > > The change I made to the reshape R function looks like this: > ===================== > for(i in seq(length = length(times))) { > - thistime <- data[data[,timevar] %in% times[i],] > - rval[,varying[,i]] <- > thistime[match(rval[,idvar],thistime[,idvar]), > - v.names] > + for (j in seq(length(v.names))) { > + rval[,varying[j,i]] <- > rep(vector(mode=typeof(data[[v.names[j]]]), 0), > + length.out=nrow(rval)) > + } > } > > + colMap <- match(varying, names(rval)) > + .Call("do_reshape_wide", > + data[[idvar]], data[[timevar]], data[v.names], > + rval, colMap, > + v.names, times, rval[[idvar]]) > + > if (!is.null(new.row.names)) > row.names(rval) <- new.row.names > ===================== > > This part: > rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval)) > is to initialize the output with appropriately-typed vectors full of > NAs; if there's a better/right way to do this please let me know. > > The do_reshape_wide C function looks like this: > ===================== > SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData, > SEXP wideFrame, SEXP colMap, > SEXP vnames, SEXP times, SEXP wideIds) { > HashData idHash, timeHash; > int longRows, numVarying; > int rowNum, varying; > int wideRow, curTime; > SEXP wideCol, longCol; > > HashTableSetup(wideIds, &idHash); > PROTECT(idHash.HashTable); > DoHashing(wideIds, &idHash); > > HashTableSetup(times, &timeHash); > PROTECT(timeHash.HashTable); > DoHashing(times, &timeHash); > > longRows = length(longIds); > numVarying = length(vnames); > > for (rowNum = 0; rowNum < longRows; rowNum++) { > /* Lookup returns 1-based answers */ > wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1; > curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1; > > for (varying = 0; varying < numVarying; varying++) { > /* colMap is 1-based */ > wideCol = VECTOR_ELT(wideFrame, > INTEGER(colMap)[(numVarying * curTime) > + varying] - 1); > longCol = VECTOR_ELT(longData, varying); > > SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, > rowNum)); > } > } > UNPROTECT(2); > } > ===================== > > None examples I recall from "Writing R extensions" had void C function > return types; is that a rule? > > I've put the code up here: > http://arctur.us/r/creshape.tar.gz > > I've never tried to make a package before, but you should be able to > just do a R CMD INSTALL on it, if you want to try it out. Here's an R > session using that code: > > require(creshape) > Loading required package: creshape > [1] TRUE > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > genotype=I(as.character(as.integer(runif(3e5, 1, 100))))) > > system.time(testWide <- creshape(test, v.names=c("genotype"), > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 60.756 1.540 62.598 0.000 0.000 > > system.time(testWide <- reshape(test, v.names=c("genotype"), > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > [1] 1278.231 78.389 1387.739 0.000 0.000 > > Any comments are appreciated, > Mitch > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel