Hi, I had a question regarding the rpart command in R. I used seven continuous predictor variables in the model and the variable called "TB122" was chosen for the first split. But in looking at the output, there are 4 variables that improve the predicted membership equally (TB122, TB139, TB144, and TB118) - output pasted below.
Node number 1: 268 observations, complexity param=0.6 predicted class=0 expected loss=0.3 class counts: 197 71 probabilities: 0.735 0.265 left son=2 (188 obs) right son=3 (80 obs) Primary splits: TB122 < 80 to the left, improve=50, (0 missing) TB139 < 90 to the left, improve=50, (0 missing) TB144 < 90 to the left, improve=50, (0 missing) TB118 < 90 to the left, improve=50, (0 missing) TB129 < 100 to the left, improve=40, (0 missing) I need to know what methods R is using to select the best variable for the node. Somewhere I read that the best split = greatest improvement in predictive accuracy = maximum homogeneity of yes/no groups resulting from the split = reduction of impurity. I also read that the Gini index, Chi-square, or G-square can be used evaluate the level of impurity. For this function in R: 1) Why exactly did R pick TB122 over the other variables despite the fact that they all had the same level of improvement? Was TB122 chosen to be the first node because the groups "TB122<80" and "TB122>80" were the most homogeneous (ie had the least impurity)? 2) If R is using impurity to determine the best nodes, which method (the Gini index, Chi-square, or G-square) is R using? Thanks! Katie -- View this message in context: http://n4.nabble.com/rpart-classification-and-regression-trees-CART-tp962680p962680.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.