Hello!
Below, I:
1. Create a data set with a bunch of factors. All of them are predictors
and 'y' is the dependent variable.
2. I run a classification Random Forests run with predictor importance. I
look at 2 measures of importance - MeanDecreaseAccuracy and MeanDecreaseGini
3. I run 2 boostrap runs for 2 Random Forests measures of importance
mentioned above.

Question: Could anyone please explain why I am getting such a huge positive
bias across the board (for all predictors) for MeanDecreaseAccuracy?

Thanks a lot!
Dimitri


#----------------------------------------------------------------
# Creating a a data set:
#-------------------------------------------------------------

N<-1000
myset1<-c(1,2,3,4,5)
probs1a<-c(.05,.10,.15,.40,.30)
probs1b<-c(.05,.15,.10,.30,.40)
probs1c<-c(.05,.05,.10,.15,.65)
myset2<-c(1,2,3,4,5,6,7)
probs2a<-c(.02,.03,.10,.15,.20,.30,.20)
probs2b<-c(.02,.03,.10,.15,.20,.20,.30)
probs2c<-c(.02,.03,.10,.10,.10,.25,.40)
myset.y<-c(1,2)
probs.y<-c(.65,.30)

set.seed(1)
y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
set.seed(2)
a<-as.factor(sample(myset1, N, replace = TRUE,probs1a))
set.seed(3)
b<-as.factor(sample(myset1, N, replace = TRUE,probs1b))
set.seed(4)
c<-as.factor(sample(myset1, N, replace = TRUE,probs1c))
set.seed(5)
d<-as.factor(sample(myset2, N, replace = TRUE,probs2a))
set.seed(6)
e<-as.factor(sample(myset2, N, replace = TRUE,probs2b))
set.seed(7)
f<-as.factor(sample(myset2, N, replace = TRUE,probs2c))

mydata<-data.frame(a,b,c,d,e,f,y)


#-------------------------------------------------------------
# Single Random Forests run with predictor importance.
#-------------------------------------------------------------

library(randomForest)
set.seed(123)
rf1<-randomForest(y~.,data=mydata,importance=T)
importance(rf1)[,c(3:4)]

#-------------------------------------------------------------
# Bootstrapping run
#-------------------------------------------------------------

library(boot)

### Defining two functions to be used for bootstrapping:

# myrf3 returns MeanDecreaseAccuracy:
myrf3<-function(usedata,idx){
  set.seed(123)
  out<-randomForest(y~.,data=usedata[idx,],importance=T)
  return(importance(out)[,3])
}

# myrf4 returns MeanDecreaseGini:
myrf4<-function(usedata,idx){
  set.seed(123)
  out<-randomForest(y~.,data=usedata[idx,],importance=T)
  return(importance(out)[,4])
}

### 2 bootstrap runs:
rfboot3<-boot(mydata,myrf3,R=10)
rfboot4<-boot(mydata,myrf4,R=10)

### Results
rfboot3   # for MeanDecreaseAccuracy
colMeans(rfboot3$t)-importance(rf1)[,3]

rfboot4   # for MeanDecreaseGini
colMeans(rfboot4$t)-importance(rf1)[,4]   # for MeanDecreaseGini

-- 
Dimitri Liakhovitski

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to