You need to isolate the problem further, or give more detail about your data. This is what I get: R> nr <- 2134 R> nc <- 14037 R> x <- matrix(runif(nr*nc), nr, nc) R> n.na <- round(nr*nc/10) R> x[sample(nr*nc, n.na)] <- NA R> system.time(x.fixed <- na.roughfix(x)) user system elapsed 8.44 0.39 8.85
R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB ram. Andy ________________________________ From: Mike Williamson [mailto:this.is....@gmail.com] Sent: Thursday, July 01, 2010 12:48 PM To: Liaw, Andy Cc: r-help Subject: Re: [R] anyone know why package "RandomForest" na.roughfix is so slow?? Andy, You're right, I didn't supply any code, because my call was very simple and it was the call itself at question. However, here is the associated code I am using: naFixTime <- system.time( { if (fltrResponse) { ## TRUE: there are no NA's in the response... cleared via earlier steps message(paste(iAm,": Missing values will now be imputed...\n", sep="")) try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet), response)], dataSet[,response]) ) } else { ## In this case, there is no "response" column in the data set message(paste(iAm,": Missing values will now be filled in with median", " values or most frequent levels", sep="")) try( dataSet <- na.roughfix(dataSet) ) } } ) As you can see, the "na.roughfix" call is made as simply as possible: I supply the entire dataSet (only parameters, no responses). I am not doing the prediction here (that is done later, and the prediction itself is not taking very long). Here are some calculation times that I experienced: # rows # cols time to run na.roughfix ======= ======= ==================== 2046 2833 ~ 2 minutes 2066 5626 ~ 6 minutes 2134 14037 ~ 30 minutes These numbers are on a Windows server using the 64-bit version of 'R'. Regards, Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote: You have not shown any code on exactly how you use na.roughfix(), so I can only guess. If you are doing something like: randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) I would not be surprised that it's taking very long on large datasets. Most likely it's caused by the formula interface, not na.roughfix() itself. If that is your case, try doing the imputation beforehand and run randomForest() afterward; e.g., myroughfixed <- na.roughfix(mybigdata) randomForest(myroughfixed[list.of.predictor.columns], myroughfixed[[myresponse]],...) HTH, Andy -----Original Message----- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Mike Williamson Sent: Wednesday, June 30, 2010 7:53 PM To: r-help Subject: [R] anyone know why package "RandomForest" na.roughfix is so slow?? Hi all, I am using the package "random forest" for random forest predictions. I like the package. However, I have fairly large data sets, and it can often take *hours* just to go through the "na.roughfix" call, which simply goes through and cleans up any NA values to either the median (numerical data) or the most frequent occurrence (factors). I am going to start doing some comparisons between na.roughfix() and some apply() functions which, it seems, are able to do the same job more quickly. But I hesitate to duplicate a function that is already in the package, since I presume the na.roughfix should be as quick as possible and it should also be well "tailored" to the requirements of random forest. Has anyone else seen that this is really slow? (I haven't noticed rfImpute to be nearly as slow, but I cannot say for sure: my "predict" data sets are MUCH larger than my model data sets, so cleaning the prediction data set simply takes much longer.) If so, any ideas how to speed this up? Thanks! Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. Notice: This e-mail message, together with any attachme...{{dropped:14}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.