Not that I want to pick on you, but can you turn off the html format in your messages? The mailing list balk at such format, and I can't reply in plain text with the right formatting of previous messages (had to manually remove the tabbed indents that Outlook added when changed to plain text).
See reply inline below. Andy From: Dimitri Liakhovitski [mailto:ld7...@gmail.com] Andy, thank you - and sorry for being a bit slow (see my questions below): On Thu, May 6, 2010 at 8:37 AM, Liaw, Andy <andy_l...@merck.com> wrote: See reply inline below. Andy From: Dimitri Liakhovitski > > I have a question about predictor importances in randomForest. > > Once I've run randomForest and got my object, I get their importances: > rfresult$importance > I also get the "standard errors" of the permutation-based importance > measure: rfresult$importanceSD > > I have 2 questions: > > 1. Because I am dealing with regressions, I am getting an > importance object > (rfresult$importance) with two columns, labeled "%IncMSE" > (the first column) > and "IncNodePurity" (the second column). I assume it's the > first one that is > the mean decrease in accuracy due to permutation. Am I correct or am I > wrong? I am confused because ?randomForest says: "or > Regression, the first > column is the mean decrease in accuracy and the second the > mean decrease in > MSE." - but it is the first column, not the second that has > "MSE" in its > header. In regression trees, node impurity is measured by MSE, therefore the second measure that averages cumulative reduction in node impurity due to splits by a variable over all trees is labelled as "mean decrease in MSE". Andy, but it is the FIRST column in $importance (not the SECOND) that is labeled "%IncMSE". The second column is labeled "IncNodePurity". So, I am confused - which one is the mean decrease in accuracy? Or, maybe I should ask again: In a case of regression trees, which of the two columns in $importance contains the predictor importances calculated by randomly permuting values and looking at how much worse the prediction has become? I assume it's the first column (labeled "%IncMSE"). Is this correct? [AL]: Note I said "reduction in node impurity", which is another way of saying "increase in node purity" 8-). I should think from the help page for importance() it should be clear which is which. When you permute the value of a variable in OOB data and make prediction, the expectation is that the MSE will increase, especially if the variable has some importance, thus the label "%IncMSE". Why do you need to assume? > 2. According to this thread ( > http://www.mail-archive.com/r-h...@stat.math.ethz.ch/msg94873. > html), The > overall importance measure is mean(d[i]) / se(d[i]), where se(d[i]) is > sd(d[i])/sqrt(ntree) (the "standard error"). > So, in order to get at the importance of predictors (and I > want to use the > permutation-based importance) - should I just take the first column of > rfresult$importance or should I first divide rfresult$importance by > rfresult$importanceSD - to get something analogous to z-scores and use > those? See the "scale" argument in ?importance. The recommended way of extracting components of an object in R is to use the extractor functions when they exist. Andy, I've run randomForest (for regression) and just wrote: Importance = TRUE. Now, I am just looking at $importance (without specifying anything at all, not scale either). So, if I do it that way - then to get the standardized permutation-based importances, should I divide the first column of $importance by $importanceSD - or has it been done by default so that the first column of $importance already contains the standardized importances? [AL]: As I said, you are recommended to use importance() to extract variable importance. The recommendation is for avoiding confusions like yours. If you want to know what the components in the objects give you, compare to what the extractor function returns, you can look inside the extractor function to find out for yourself. Really, I'm not trying to be difficult, but there are very good reasons for not accessing the components directly when extractor functions exist. If the underlying components are somehow changed in the future, only the extractor functions are guaranteed to give you the "right thing". I added the extractor function for importance measures precisely because the way they are computed changed. Thank you! Dimitri > Thank you very much! > > -- > Dimitri Liakhovitski > Ninah.com > dimitri.liakhovit...@ninah.com > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system. -- Dimitri Liakhovitski Ninah.com dimitri.liakhovit...@ninah.com Notice: This e-mail message, together with any attachme...{{dropped:11}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.