Greetings tree and forest coders- I'm interested in comparing randomforests and regression tree/ bagging tree models. I'd like to propose a basis for doing this, get feedback, and document this here. I kept it in this thread since that makes sense.
In this case I think it's appropriate to compare the R^2 values as one basic measure. I'm actually going to compare mean error (ME), mean absolute error (MAE), root mean squared error (RMSE) as well. This means that I need estimates from each approach so that I can form residuals. **As I see it, the important details are in how to set up the models so that I have comparable estimates, particularly in how the trees/forests are trained and evaluated.** For regression/bagging trees, the typical approach for my application is 100 runs of 10-fold CV. In each run all the values are estimated in an out-of-the-bag sense; each "fold" is estimated while it is withheld from fitting, thus fit is not inflated. The estimates are then averaged over the 100 runs at each point to get an average simulation and this is used to calculate residuals and the measures mentioned above. Somewhat more specifically, the steps are: I fit a model, I prune it via inspection, I loop 100 times on xpred.rpart(model,xval=10,cp=cp at bottom of cptable from pruned fit) to generate the 100 runs (bagging is thus performed while holding the cp criteria fixed?), I average these pointwise, I calculate the desired stats/quantities for comparison to other models. For randomForests, I would want to fit the model in a similar way, ie 100 runs of 10-fold CV. I think the 10-fold part is clear, the 100 runs, maybe less so. To get 10-fold OOB estimates, I set replace=FALSE, sampsize=.9*nrow(x). Then I get a randomForest with $predicted being the average OOB estimates over all trees for which each point was OOB. I would assume that each tree is constructed with a different 10-fold partitioning of the data set. Thus the number of runs is really more like the number of trees constructed. If i wanted to be really thorough, I could fit 100 random forests and get the $predicted for each and then average these pointwise. But that seems like over kill; isnt that the lesson of plot.randomForest that as the # of trees goes up the error converges to some limit. (from what i've seen). Thus, my primary concern is in the amount of data used for training and cross validating the model in an out-of-bag sense; can i meaningfully compare 10-fold oob estimates sing xpred.rpart to a random forest fit using 90% of the data as sampsize? Of secondary concern is the number of bagging trees versus then number of trees in the random forest. As long as the average estimate error is nearing some limit with the number of bagging trees I'm using, I think this is all that matters. So this is more of methodological difference to be retained, similar to differences in pruning under bagging and random forests, though I should probably specify the node sizes to be similar for each. Am I overlooking anything of grave consequence? Any and all thoughts are welcome. If you are aware of any comparisons of rpart and randomForests in the literature for any field (for regression) of which I am ignorant, I would appreciate the tip. I have read over "Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction" by Prasad, Iverson, and Liaw. I may have missed it, but I did not see discussion of maintaining consistency in the way the models were trained, though it is a very nice paper overall and contained many interesting approaches and points. Thanks in advance, James -- View this message in context: http://www.nabble.com/-R--comparing-random-forests-and-classification-trees-tp8682315p25491934.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.