Dear all R experts, I have a question about using cross-validation to assess results estimated from a classification tree model. I annotated what each line does in the R code chunk below. Basically, I split the data, named usedta, into 70% vs. 30%, with the training set having 70% and the test set 30% of the original cases. After splitting the data, I first run a classification tree off the training set, and then use the results for cross-validation using the test set. It turns out that if I don't have any predictors and make predictions by simply betting on the majority class of the zero-one coding of the binary response variable, I can do better than what the results from the classification tree would deliver in the test set. What would this imply and what would cause this problem? Does it mean that classification tree is not an appropriate method for my data; or, it's because I have too few variables? Thanks a lot!
Jun Xu, PhD Professor Department of Sociology Ball State University Muncie, IN 47306 USA Using the estimates, I get the following prediction rate (correct prediction) using the test set. Or we can say the misclassification error rate is 1-0.837 = 0.163 > (tab[1,1] + tab[2,2]) / sum(tab)[1] 0.837 Without any predictors, I can get the following rate by betting on the majority class every time, again using data from the test set. In this case, the misclassification error rate is 1-0.85 = 0.15 > table(h2.test)h2.test 1poorHlth 0goodHlth 101 575 > 571/(571+101)[1] 0.85 R Code Chunk # set the seed for random number generator for replication set.seed(47306) # have the 7/3 split with 70% of the cases allotted to the training set # AND create the training set identifier class.train = sample(1:nrow(usedta), nrow(usedta)*0.7) # create the test set indicator class.test = (-class.train) # create a vector for the binary response variable from the test set # for future cross-tabulation. h2.test <- usedta$h2[class.test] # count the train set cases Ntrain = length(usedta$h2[class.train]) # run the classification tree model using the training set # h2 is the binary response and other variables are predictors tree.h2 <- tree(h2 ~ age + educ + female + white + married + happy, data = usedta, subset = class.train, control = tree.control(nobs=Ntrain, mindev=0.003)) # summary results summary(tree.h2) # make predictions of h2 using the test set tree.h2.pred <- predict(tree.h2, usedta[class.test,], type="class") # cross tab the predictions using the test set table(tree.h2.pred, h2.test) tab = table(tree.h2.pred, h2.test) # calculate the ratio for the correctly predicted in the test set (tab[1,1] + tab[2,2]) / sum(tab) # calculate the ratio for the correctly predicted using the naive approach # by betting on the majority category. table(h2.test)[2]/sum(tab) [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.