Here is a great response I got from SO: There is an important difference between the two importance measures: MeanDecreaseAccuracy is calculated using out of bag (OOB) data, MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated on observations not used to form that particular tree. In contrast, MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are. It is calculated using the same data used to fit trees.
When you bootstrap data, you are creating multiple copies of the same observations. Therefore the same observation can be split into two copies, one to form a tree, and one treated as OOB and used to calculate accuracy measures. Therefore, data that randomForest thinks is OOB for MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample, making the estimate of MeanDecreaseAccuracy overly optimistic in the bootstrap iterations. Gini index is immune to this, because it is not relying on evaluating importance on observations different from those used to fit the data. I suspect what you are trying to do is use the bootstrap to generate inference (p-values/confidence intervals) indicating which variables are "important" in the sense that they are actually predictive of your outcome. The bootstrap is not appropriate in this context, because Random Forests expects that OOB data is truly OOB and this is important for building the forest in the first place. In general, bootstrap is not universally applicable, and is only useful in cases where it can be shown that the parameter you're estimating has nice asymptotic properties and is not sensitive to "ties" in the data. A procedure like Random Forest which relies on the availability of OOB data is necessarily sensitive to ties. You may want to look at the caret package in R, which uses random forest (or one of a set of many other algorithms) inside a cross-validation loop to determine which variables are consistently important. See: http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski < dimitri.liakhovit...@gmail.com> wrote: > Thank you, Bert. I'll definitely ask there. > In the meantime I just wanted to ensure that my R code (my function for > bootstrap and the bootstrap run) is correct and my abnormal bootstrap > results are not a function of my erroneous code. > Thank you! > > > > On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter <gunter.ber...@gene.com>wrote: > >> I **think** this kind of methodological issue might be better at SO >> (stats.stackexchange.com). It's not really about R programming, which >> is the main focus of this list. And yes, I know they do intersect. >> Nevertheless... >> >> Cheers, >> Bert >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> (650) 467-7374 >> >> "Data is not information. Information is not knowledge. And knowledge >> is certainly not wisdom." >> H. Gilbert Welch >> >> >> >> >> On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski >> <dimitri.liakhovit...@gmail.com> wrote: >> > Hello! >> > Below, I: >> > 1. Create a data set with a bunch of factors. All of them are predictors >> > and 'y' is the dependent variable. >> > 2. I run a classification Random Forests run with predictor importance. >> I >> > look at 2 measures of importance - MeanDecreaseAccuracy and >> MeanDecreaseGini >> > 3. I run 2 boostrap runs for 2 Random Forests measures of importance >> > mentioned above. >> > >> > Question: Could anyone please explain why I am getting such a huge >> positive >> > bias across the board (for all predictors) for MeanDecreaseAccuracy? >> > >> > Thanks a lot! >> > Dimitri >> > >> > >> > #---------------------------------------------------------------- >> > # Creating a a data set: >> > #------------------------------------------------------------- >> > >> > N<-1000 >> > myset1<-c(1,2,3,4,5) >> > probs1a<-c(.05,.10,.15,.40,.30) >> > probs1b<-c(.05,.15,.10,.30,.40) >> > probs1c<-c(.05,.05,.10,.15,.65) >> > myset2<-c(1,2,3,4,5,6,7) >> > probs2a<-c(.02,.03,.10,.15,.20,.30,.20) >> > probs2b<-c(.02,.03,.10,.15,.20,.20,.30) >> > probs2c<-c(.02,.03,.10,.10,.10,.25,.40) >> > myset.y<-c(1,2) >> > probs.y<-c(.65,.30) >> > >> > set.seed(1) >> > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) >> > set.seed(2) >> > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a)) >> > set.seed(3) >> > b<-as.factor(sample(myset1, N, replace = TRUE,probs1b)) >> > set.seed(4) >> > c<-as.factor(sample(myset1, N, replace = TRUE,probs1c)) >> > set.seed(5) >> > d<-as.factor(sample(myset2, N, replace = TRUE,probs2a)) >> > set.seed(6) >> > e<-as.factor(sample(myset2, N, replace = TRUE,probs2b)) >> > set.seed(7) >> > f<-as.factor(sample(myset2, N, replace = TRUE,probs2c)) >> > >> > mydata<-data.frame(a,b,c,d,e,f,y) >> > >> > >> > #------------------------------------------------------------- >> > # Single Random Forests run with predictor importance. >> > #------------------------------------------------------------- >> > >> > library(randomForest) >> > set.seed(123) >> > rf1<-randomForest(y~.,data=mydata,importance=T) >> > importance(rf1)[,c(3:4)] >> > >> > #------------------------------------------------------------- >> > # Bootstrapping run >> > #------------------------------------------------------------- >> > >> > library(boot) >> > >> > ### Defining two functions to be used for bootstrapping: >> > >> > # myrf3 returns MeanDecreaseAccuracy: >> > myrf3<-function(usedata,idx){ >> > set.seed(123) >> > out<-randomForest(y~.,data=usedata[idx,],importance=T) >> > return(importance(out)[,3]) >> > } >> > >> > # myrf4 returns MeanDecreaseGini: >> > myrf4<-function(usedata,idx){ >> > set.seed(123) >> > out<-randomForest(y~.,data=usedata[idx,],importance=T) >> > return(importance(out)[,4]) >> > } >> > >> > ### 2 bootstrap runs: >> > rfboot3<-boot(mydata,myrf3,R=10) >> > rfboot4<-boot(mydata,myrf4,R=10) >> > >> > ### Results >> > rfboot3 # for MeanDecreaseAccuracy >> > colMeans(rfboot3$t)-importance(rf1)[,3] >> > >> > rfboot4 # for MeanDecreaseGini >> > colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini >> > >> > -- >> > Dimitri Liakhovitski >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Dimitri Liakhovitski > -- Dimitri Liakhovitski [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.