Andy, thank you - and sorry for being a bit slow (see my questions below): On Thu, May 6, 2010 at 8:37 AM, Liaw, Andy <andy_l...@merck.com> wrote:
> See reply inline below. > > Andy > > From: Dimitri Liakhovitski > > > > I have a question about predictor importances in randomForest. > > > > Once I've run randomForest and got my object, I get their importances: > > rfresult$importance > > I also get the "standard errors" of the permutation-based importance > > measure: rfresult$importanceSD > > > > I have 2 questions: > > > > 1. Because I am dealing with regressions, I am getting an > > importance object > > (rfresult$importance) with two columns, labeled "%IncMSE" > > (the first column) > > and "IncNodePurity" (the second column). I assume it's the > > first one that is > > the mean decrease in accuracy due to permutation. Am I correct or am I > > wrong? I am confused because ?randomForest says: "or > > Regression, the first > > column is the mean decrease in accuracy and the second the > > mean decrease in > > MSE." - but it is the first column, not the second that has > > "MSE" in its > > header. > > In regression trees, node impurity is measured by MSE, therefore the > second measure that averages cumulative reduction in node impurity due > to splits by a variable over all trees is labelled as "mean decrease in > MSE". > Andy, but it is the FIRST column in $importance (not the SECOND) that is labeled "%IncMSE". The second column is labeled "IncNodePurity". So, I am confused - which one is the mean decrease in accuracy? Or, maybe I should ask again: In a case of regression trees, which of the two columns in $importance contains the predictor importances calculated by randomly permuting values and looking at how much worse the prediction has become? I assume it's the first column (labeled "%IncMSE"). Is this correct? > > > 2. According to this thread ( > > http://www.mail-archive.com/r-h...@stat.math.ethz.ch/msg94873. > > html), The > > overall importance measure is mean(d[i]) / se(d[i]), where se(d[i]) is > > sd(d[i])/sqrt(ntree) (the "standard error"). > > So, in order to get at the importance of predictors (and I > > want to use the > > permutation-based importance) - should I just take the first column of > > rfresult$importance or should I first divide rfresult$importance by > > rfresult$importanceSD - to get something analogous to z-scores and use > > those? > > See the "scale" argument in ?importance. The recommended way of > extracting components of an object in R is to use the extractor > functions when they exist. > > Andy, I've run randomForest (for regression) and just wrote: Importance = TRUE. Now, I am just looking at $importance (without specifying anything at all, not scale either). So, if I do it that way - then to get the standardized permutation-based importances, should I divide the first column of $importance by $importanceSD - or has it been done by default so that the first column of $importance already contains the standardized importances? Thank you! Dimitri > > Thank you very much! > > > > -- > > Dimitri Liakhovitski > > Ninah.com > > dimitri.liakhovit...@ninah.com > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > Notice: This e-mail message, together with any attach...{{dropped:22}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.