Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Liaw, Andy Thu, 01 Jul 2010 10:12:40 -0700

You need to isolate the problem further, or give more detail about your
data.  This is what I get:
 
R> nr <- 2134
R> nc <- 14037
R> x <- matrix(runif(nr*nc), nr, nc)
R> n.na <- round(nr*nc/10)
R> x[sample(nr*nc, n.na)] <- NA
R> system.time(x.fixed <- na.roughfix(x))
   user  system elapsed 
   8.44    0.39    8.85


R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with
2GB ram.
 
Andy

________________________________

From: Mike Williamson [mailto:this.is....@gmail.com] 
Sent: Thursday, July 01, 2010 12:48 PM
To: Liaw, Andy
Cc: r-help
Subject: Re: [R] anyone know why package "RandomForest" na.roughfix is
so slow??


Andy,

    You're right, I didn't supply any code, because my call was very
simple and it was the call itself at question.  However, here is the
associated code I am using:


        naFixTime <- system.time( {
            if (fltrResponse) {  ## TRUE: there are no NA's in the
response... cleared via earlier steps
                message(paste(iAm,": Missing values will now be
imputed...\n", sep=""))
        try( dataSet <- rfImpute(dataSet[,!is.element(names(dataSet),
response)],
                                         dataSet[,response]) )
            } else {  ## In this case, there is no "response" column in
the data set
                message(paste(iAm,": Missing values will now be filled
in with median",
                              " values or most frequent levels",
sep=""))
                try( dataSet <- na.roughfix(dataSet) )
            }
        } )



    As you can see, the "na.roughfix" call is made as simply as
possible:  I supply the entire dataSet (only parameters, no responses).
I am not doing the prediction here (that is done later, and the
prediction itself is not taking very long).
    Here are some calculation times that I experienced:

# rows       # cols       time to run na.roughfix
=======     =======     ====================
  2046          2833             ~ 2 minutes
  2066          5626             ~ 6 minutes
  2134         14037             ~ 30 minutes

    These numbers are on a Windows server using the 64-bit version of
'R'.

                                          Regards,
                                                   Mike


"Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here."
 -- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en



On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy <andy_l...@merck.com> wrote:


        You have not shown any code on exactly how you use
na.roughfix(), so I
        can only guess.
        
        If you are doing something like:
        
         randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)
        
        I would not be surprised that it's taking very long on large
datasets.
        Most likely it's caused by the formula interface, not
na.roughfix()
        itself.
        
        If that is your case, try doing the imputation beforehand and
run
        randomForest() afterward; e.g.,
        
        myroughfixed <- na.roughfix(mybigdata)
        randomForest(myroughfixed[list.of.predictor.columns],
        myroughfixed[[myresponse]],...)
        
        HTH,
        Andy
        

        -----Original Message-----
        From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org]
        On Behalf Of Mike Williamson
        Sent: Wednesday, June 30, 2010 7:53 PM
        To: r-help
        Subject: [R] anyone know why package "RandomForest" na.roughfix
is so
        slow??
        
        Hi all,
        
           I am using the package "random forest" for random forest
        predictions.  I
        like the package.  However, I have fairly large data sets, and
it can
        often
        take *hours* just to go through the "na.roughfix" call, which
simply
        goes
        through and cleans up any NA values to either the median
(numerical
        data) or
        the most frequent occurrence (factors).
           I am going to start doing some comparisons between
na.roughfix() and
        some apply() functions which, it seems, are able to do the same
job more
        quickly.  But I hesitate to duplicate a function that is already
in the
        package, since I presume the na.roughfix should be as quick as
possible
        and
        it should also be well "tailored" to the requirements of random
forest.
        
           Has anyone else seen that this is really slow?  (I haven't
noticed
        rfImpute to be nearly as slow, but I cannot say for sure:  my
"predict"
        data
        sets are MUCH larger than my model data sets, so cleaning the
prediction
        data set simply takes much longer.)
           If so, any ideas how to speed this up?
        
                                     Thanks!
                                          Mike
        
        
        
        "Telescopes and bathyscaphes and sonar probes of Scottish lakes,
        Tacoma Narrows bridge collapse explained with abstract
phase-space maps,
        Some x-ray slides, a music score, Minard's Napoleanic war:
        The most exciting frontier is charting what's already here."
         -- xkcd
        
        --
        Help protect Wikipedia. Donate now:
        http://wikimediafoundation.org/wiki/Support_Wikipedia/en
        
        
               [[alternative HTML version deleted]]
        
        ______________________________________________
        R-help@r-project.org mailing list
        https://stat.ethz.ch/mailman/listinfo/r-help
        PLEASE do read the posting guide
        http://www.R-project.org/posting-guide.html
        and provide commented, minimal, self-contained, reproducible
code.
        
        Notice:  This e-mail message, together with any attachments,
contains
        information of Merck & Co., Inc. (One Merck Drive, Whitehouse
Station,
        New Jersey, USA 08889), and/or its affiliates Direct contact
information
        for affiliates is available at
        http://www.merck.com/contact/contacts.html) that may be
confidential,
        proprietary copyrighted and/or legally privileged. It is
intended solely
        for the use of the individual or entity named on this message.
If you are
        not the intended recipient, and have received this message in
error,
        please notify us immediately by reply e-mail and then delete it
from
        your system.
        
        


Notice:  This e-mail message, together with any attachme...{{dropped:14}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] anyone know why package "RandomForest" na.roughfix is so slow??

Reply via email to