Fiona Callaghan asked about using the bootstrap instead of cross-validation in the tree pruning step. It turns out that cross-validation works better than the bootstrap for trees. The issue is a subtle one. The bootstrap can be thought of as 2 steps.
1. Deduction: Evaluate the behavior of some statistic "zed" under repeated sampling from the discrete distribution F-hat, i.e., the original data. This gives a direct evaluation of how zed behaves under F-hat. 2. Induction: Assume that (behavior of zed under sampling from F) = (behavior under sampling from F-hat). It turns out that trees behave differently under discreet distributions than they do under continuous ones, so step 2 fails. Essentially, there are fewer places to split in the discrete case, tree creation is less noisy, and the bootstrap gives an overoptimistic view. I remember Brad Efron giving a talk on this long ago (I was still a student!), so the details are fuzzy; I think that he solved it by sampling from a smoothed version of the empirical CDF. Terry Therneau ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.