Hello! Below, I: 1. Create a data set with a bunch of factors. All of them are predictors and 'y' is the dependent variable. 2. I run a classification Random Forests run with predictor importance. I look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini 3. I run 2 boostrap runs for 2 Random Forests measures of importance mentioned above.
Question: Could anyone please explain why I am getting such a huge positive bias across the board (for all predictors) for MeanDecreaseAccuracy? Thanks a lot! Dimitri #---------------------------------------------------------------- # Creating a a data set: #------------------------------------------------------------- N<-1000 myset1<-c(1,2,3,4,5) probs1a<-c(.05,.10,.15,.40,.30) probs1b<-c(.05,.15,.10,.30,.40) probs1c<-c(.05,.05,.10,.15,.65) myset2<-c(1,2,3,4,5,6,7) probs2a<-c(.02,.03,.10,.15,.20,.30,.20) probs2b<-c(.02,.03,.10,.15,.20,.20,.30) probs2c<-c(.02,.03,.10,.10,.10,.25,.40) myset.y<-c(1,2) probs.y<-c(.65,.30) set.seed(1) y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) set.seed(2) a<-as.factor(sample(myset1, N, replace = TRUE,probs1a)) set.seed(3) b<-as.factor(sample(myset1, N, replace = TRUE,probs1b)) set.seed(4) c<-as.factor(sample(myset1, N, replace = TRUE,probs1c)) set.seed(5) d<-as.factor(sample(myset2, N, replace = TRUE,probs2a)) set.seed(6) e<-as.factor(sample(myset2, N, replace = TRUE,probs2b)) set.seed(7) f<-as.factor(sample(myset2, N, replace = TRUE,probs2c)) mydata<-data.frame(a,b,c,d,e,f,y) #------------------------------------------------------------- # Single Random Forests run with predictor importance. #------------------------------------------------------------- library(randomForest) set.seed(123) rf1<-randomForest(y~.,data=mydata,importance=T) importance(rf1)[,c(3:4)] #------------------------------------------------------------- # Bootstrapping run #------------------------------------------------------------- library(boot) ### Defining two functions to be used for bootstrapping: # myrf3 returns MeanDecreaseAccuracy: myrf3<-function(usedata,idx){ set.seed(123) out<-randomForest(y~.,data=usedata[idx,],importance=T) return(importance(out)[,3]) } # myrf4 returns MeanDecreaseGini: myrf4<-function(usedata,idx){ set.seed(123) out<-randomForest(y~.,data=usedata[idx,],importance=T) return(importance(out)[,4]) } ### 2 bootstrap runs: rfboot3<-boot(mydata,myrf3,R=10) rfboot4<-boot(mydata,myrf4,R=10) ### Results rfboot3 # for MeanDecreaseAccuracy colMeans(rfboot3$t)-importance(rf1)[,3] rfboot4 # for MeanDecreaseGini colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini -- Dimitri Liakhovitski [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.