Re: [R] Predictor Importance in Random Forests and bootstrap

Dimitri Liakhovitski Tue, 28 Jan 2014 14:36:28 -0800

Here is a great response I got from SO:

There is an important difference between the two importance measures:
MeanDecreaseAccuracy is calculated using out of bag (OOB) data,
MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated
on observations not used to form that particular tree. In contrast,
MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are.
It is calculated using the same data used to fit trees.


When you bootstrap data, you are creating multiple copies of the same
observations. Therefore the same observation can be split into two copies,
one to form a tree, and one treated as OOB and used to calculate accuracy
measures. Therefore, data that randomForest thinks is OOB for
MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample,
making the estimate of MeanDecreaseAccuracy overly optimistic in the
bootstrap iterations. Gini index is immune to this, because it is not
relying on evaluating importance on observations different from those used
to fit the data.

I suspect what you are trying to do is use the bootstrap to generate
inference (p-values/confidence intervals) indicating which variables are
"important" in the sense that they are actually predictive of your outcome.
The bootstrap is not appropriate in this context, because Random Forests
expects that OOB data is truly OOB and this is important for building the
forest in the first place. In general, bootstrap is not universally
applicable, and is only useful in cases where it can be shown that the
parameter you're estimating has nice asymptotic properties and is not
sensitive to "ties" in the data. A procedure like Random Forest which
relies on the availability of OOB data is necessarily sensitive to ties.

You may want to look at the caret package in R, which uses random forest
(or one of a set of many other algorithms) inside a cross-validation loop
to determine which variables are consistently important. See:



http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf


On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski <
dimitri.liakhovit...@gmail.com> wrote:

> Thank you, Bert. I'll definitely ask there.
> In the meantime I just wanted to ensure that my R code (my function for
> bootstrap and the bootstrap run) is correct and my abnormal bootstrap
> results are not a function of my erroneous code.
> Thank you!
>
>
>
> On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter <gunter.ber...@gene.com>wrote:
>
>> I **think** this kind of methodological issue might be better at SO
>> (stats.stackexchange.com).  It's not really about R programming, which
>> is the main focus of this list. And yes, I know they do intersect.
>> Nevertheless...
>>
>> Cheers,
>> Bert
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>> (650) 467-7374
>>
>> "Data is not information. Information is not knowledge. And knowledge
>> is certainly not wisdom."
>> H. Gilbert Welch
>>
>>
>>
>>
>> On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski
>> <dimitri.liakhovit...@gmail.com> wrote:
>> > Hello!
>> > Below, I:
>> > 1. Create a data set with a bunch of factors. All of them are predictors
>> > and 'y' is the dependent variable.
>> > 2. I run a classification Random Forests run with predictor importance.
>> I
>> > look at 2 measures of importance - MeanDecreaseAccuracy and
>> MeanDecreaseGini
>> > 3. I run 2 boostrap runs for 2 Random Forests measures of importance
>> > mentioned above.
>> >
>> > Question: Could anyone please explain why I am getting such a huge
>> positive
>> > bias across the board (for all predictors) for MeanDecreaseAccuracy?
>> >
>> > Thanks a lot!
>> > Dimitri
>> >
>> >
>> > #----------------------------------------------------------------
>> > # Creating a a data set:
>> > #-------------------------------------------------------------
>> >
>> > N<-1000
>> > myset1<-c(1,2,3,4,5)
>> > probs1a<-c(.05,.10,.15,.40,.30)
>> > probs1b<-c(.05,.15,.10,.30,.40)
>> > probs1c<-c(.05,.05,.10,.15,.65)
>> > myset2<-c(1,2,3,4,5,6,7)
>> > probs2a<-c(.02,.03,.10,.15,.20,.30,.20)
>> > probs2b<-c(.02,.03,.10,.15,.20,.20,.30)
>> > probs2c<-c(.02,.03,.10,.10,.10,.25,.40)
>> > myset.y<-c(1,2)
>> > probs.y<-c(.65,.30)
>> >
>> > set.seed(1)
>> > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
>> > set.seed(2)
>> > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a))
>> > set.seed(3)
>> > b<-as.factor(sample(myset1, N, replace = TRUE,probs1b))
>> > set.seed(4)
>> > c<-as.factor(sample(myset1, N, replace = TRUE,probs1c))
>> > set.seed(5)
>> > d<-as.factor(sample(myset2, N, replace = TRUE,probs2a))
>> > set.seed(6)
>> > e<-as.factor(sample(myset2, N, replace = TRUE,probs2b))
>> > set.seed(7)
>> > f<-as.factor(sample(myset2, N, replace = TRUE,probs2c))
>> >
>> > mydata<-data.frame(a,b,c,d,e,f,y)
>> >
>> >
>> > #-------------------------------------------------------------
>> > # Single Random Forests run with predictor importance.
>> > #-------------------------------------------------------------
>> >
>> > library(randomForest)
>> > set.seed(123)
>> > rf1<-randomForest(y~.,data=mydata,importance=T)
>> > importance(rf1)[,c(3:4)]
>> >
>> > #-------------------------------------------------------------
>> > # Bootstrapping run
>> > #-------------------------------------------------------------
>> >
>> > library(boot)
>> >
>> > ### Defining two functions to be used for bootstrapping:
>> >
>> > # myrf3 returns MeanDecreaseAccuracy:
>> > myrf3<-function(usedata,idx){
>> >   set.seed(123)
>> >   out<-randomForest(y~.,data=usedata[idx,],importance=T)
>> >   return(importance(out)[,3])
>> > }
>> >
>> > # myrf4 returns MeanDecreaseGini:
>> > myrf4<-function(usedata,idx){
>> >   set.seed(123)
>> >   out<-randomForest(y~.,data=usedata[idx,],importance=T)
>> >   return(importance(out)[,4])
>> > }
>> >
>> > ### 2 bootstrap runs:
>> > rfboot3<-boot(mydata,myrf3,R=10)
>> > rfboot4<-boot(mydata,myrf4,R=10)
>> >
>> > ### Results
>> > rfboot3   # for MeanDecreaseAccuracy
>> > colMeans(rfboot3$t)-importance(rf1)[,3]
>> >
>> > rfboot4   # for MeanDecreaseGini
>> > colMeans(rfboot4$t)-importance(rf1)[,4]   # for MeanDecreaseGini
>> >
>> > --
>> > Dimitri Liakhovitski
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Dimitri Liakhovitski
>



-- 
Dimitri Liakhovitski

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Predictor Importance in Random Forests and bootstrap

Reply via email to