Hi,

There is an excellent article at http://www.biomedcentral.com/1471-2105/9/307 
by Stroble, et al. describing variable importance in random forests.  Does 
anyone have any suggestions (besides imputation or removal of cases) for how to 
deal with data that *have* missing data for predictor variables?

Below is an excerpt of some code referenced in the article.  I have commented 
out one line and added one additional line.  The code runs beautifully if only 
complete cases are included and (though it builds the tree) breaks at the 
variable importance step missing data are presented.

# From http://www.biomedcentral.com/content/supplementary/1471-2105-8-25-S1.R

require("party")

arabidopsis_url <- 
"http://www.biomedcentral.com/content/supplementary/1471-2105-5-132-S1.txt";

arabidopsis <- read.table(arabidopsis_url, header = TRUE,
                          sep = " ", na.string = "X")

#arabidopsis <- subset(arabidopsis, complete.cases(arabidopsis))
arabidopsis <- subset(arabidopsis, is.na(arabidopsis$edit)==FALSE)

arabidopsis <- arabidopsis[, !(names(arabidopsis) %in% c("X0", "loc"))]

my_cforest_control <- cforest_control(teststat = "quad",
    testtype = "Univ", mincriterion = 0, ntree = 50, mtry = 3,
    replace = TRUE)

my_cforest <- cforest(edit ~ ., data = arabidopsis,
                      controls = my_cforest_control)
varimp_cforest <-  varimp(my_cforest)

By the way, the same issue arises for the randomForest package.

Does anyone have any suggestions?  I'm more interested in the variable 
importance than the tree per se.

Thanks,

Jason

Jason Jones, PhD
Medical Informatics
[EMAIL PROTECTED]
801.707.6898

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to