I'm working with some data from which a client would like to make a decision tree predicting brand preference based on inputs such as price, speed, etc. After running the decision tree analysis using rpart, it appears that this data is not capable of predicting brand preference.
Here's the data set: BRND PRI PROM FORM FAMI DRRE FREC MODE SPED REVW Brand 1 0.6989 0.4731 0.7849 0.6989 0.7419 0.6022 0.8817 0.9032 0.6452 Brand 2 0.8621 0.3793 0.8621 0.931 0.7586 0.6897 0.8966 0.9655 0.8276 Brand 3 0.6 0.1 0.6 0.7 0.9 0.7 0.7 0.8 0.6 Brand 4 0.6429 0.25 0.5714 0.5 0.6071 0.5 0.75 0.8214 0.5 Brand 5 0.7586 0.4224 0.7328 0.6638 0.7328 0.6379 0.8621 0.8621 0.6897 Brand 6 0.75 0.0833 0.5833 0.4167 0.5 0.4167 0.75 0.6667 0.5 Brand 7 0.7742 0.4839 0.6129 0.5161 0.8065 0.6452 0.7742 0.9032 0.6129 Brand 8 0.6429 0.2679 0.6964 0.7143 0.875 0.5536 0.8036 0.9464 0.6607 Brand 9 0.575 0.175 0.65 0.55 0.625 0.375 0.825 0.85 0.475 Brand 10 0.8095 0.5238 0.6667 0.6429 0.6667 0.5952 0.8571 0.8095 0.5714 Brand 11 0.6308 0.3 0.6077 0.5846 0.6769 0.5231 0.7462 0.8846 0.6 Brand 12 0.7212 0.3152 0.7152 0.6545 0.6606 0.503 0.8061 0.8909 0.6 Brand 13 0.7419 0.2258 0.6129 0.5806 0.7097 0.6129 0.871 0.9677 0.3226 Brand 14 0.7176 0.2706 0.6353 0.5647 0.6941 0.4471 0.7176 0.9412 0.5176 Brand 15 0.7287 0.3437 0.5995 0.5788 0.8527 0.5478 0.8217 0.8941 0.6227 Brand 16 0.7 0.4 0.6 0.4 1 0.4 0.9 0.9 0.5 Brand 17 0.7193 0.3333 0.6667 0.6667 0.7018 0.5263 0.7719 0.8596 0.7018 Brand 18 0.7778 0.4127 0.6508 0.6349 0.7937 0.6032 0.8571 0.9206 0.619 Brand 19 0.8028 0.2817 0.6197 0.4366 0.7042 0.4366 0.7183 0.9155 0.5634 Brand 20 0.7736 0.2453 0.6226 0.3774 0.5849 0.3019 0.717 0.8679 0.4717 Brand 21 0.8481 0.2152 0.6329 0.4051 0.6329 0.4557 0.6962 0.8481 0.3418 Brand 22 0.75 0.3333 0.6667 0.5 0.6667 0.5833 0.9167 0.9167 0.4167 Here are my R commands: > test.df = read.csv("test.csv") > head(test.df) BRND PRI PROM FORM FAMI DRRE FREC MODE SPED REVW 1 Brand 1 0.6989 0.4731 0.7849 0.6989 0.7419 0.6022 0.8817 0.9032 0.6452 2 Brand 2 0.8621 0.3793 0.8621 0.9310 0.7586 0.6897 0.8966 0.9655 0.8276 3 Brand 3 0.6000 0.1000 0.6000 0.7000 0.9000 0.7000 0.7000 0.8000 0.6000 4 Brand 4 0.6429 0.2500 0.5714 0.5000 0.6071 0.5000 0.7500 0.8214 0.5000 5 Brand 5 0.7586 0.4224 0.7328 0.6638 0.7328 0.6379 0.8621 0.8621 0.6897 6 Brand 6 0.7500 0.0833 0.5833 0.4167 0.5000 0.4167 0.7500 0.6667 0.5000 > testTree = rpart(BRAND~PRI + PROM + FORM + FAMI+ DRRE + FREC + MODE + > SPED + REVW, method="class", data=test.df) > printcp(testTree) Classification tree: rpart(formula = BRND ~ PRI + PROM + FORM + FAMI + DRRE + FREC + MODE + SPED + REVW, data = test.df, method = "class") Variables actually used in tree construction: [1] FORM Root node error: 21/22 = 0.95455 n= 22 CP nsplit rel error xerror xstd 1 0.047619 0 1.00000 1.0476 0 2 0.010000 1 0.95238 1.0476 0 I note that only one variable (FORM) was actually used in tree construction. When I run a plot using: > plot(testTree) > text(testTree) ...I get a tree with one branch. It looks to me like I'm doing everything right, and this data is just not capable of predicting brand preference. Am I missing anything? Thanks very much in advance for any thoughts! -Vik [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.