Matthew, Your intepretation of calculating error rates based on the training data is incorrect.
In Andy Liaw's help file "err.rate-- (classification only) vector error rates of the prediction on the input data, the i-th element being the (OOB) error rate for all trees up to the i-th." My understanding is that the error rate is calculated by throwing the OOB cases(after a few trees, all cases in the original data would serve as OOB for some trees) to all the trees up to the i-th which they are OOB and get the majority vote. The plot of a rf object indicates that OOB error declines quickly after the ensemble becomes sizable and increase variation in trees works! ( If they are based on the training sets, you wouldn't see such a drop since each tree is overfitting to the training set) Weidong On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis <mattjamesfran...@gmail.com> wrote: > Thanks for the help. Let me explain in more detail how I think that > randomForest works so that you (or others) can more easily see the > error of my ways. > > The function first takes a random sample of the data, of the size > specified by the sampsize argument. With this it fully grows a tree > resulting in a horribly over-fitted classifier for the random sub-set. > It then repeats this again with a different sample to generate the > next tree and so on. > > Now, my understanding is that after each tree is constructed, a test > prediction for the *whole* training data set is made by combining the > results of all trees (so e.g. for classification the majority votes of > all individual tree predictions). From this an error rate is > determined (applicable to the ensemble applied to the training data) > and reported in the err.rate member of the returned randomForest > object. If you look at the error rate (or plot it using the default > plot method) you see that it starts out very high when only 1 or a few > over-fitted trees are contributing, but once the forest gets larger > the error rate drops since the ensemble is doing its job. It doesn't > make sense to me that this error rate is for a sub-set of the data, > since the sub-set in question changes at each step (i.e. at each tree > construction)? > > By doing cross-validation test making 'training' and 'test' sets from > the data I have, I do find that I get error rates on the test sets > comparable to the error rate that is obtained from the prediction > member of the returned randomForest object. So that does seem to be > the 'correct' error. > > By my understanding the error reported for the ith tree is that > obtained using all trees up to and including the ith tree to make an > ensemble prediction. Therefore the final error reported should be the > same as that obtained using the predict.randomForest function on the > training set, because by my understanding that should return an > identical result to that used to generate the error rate for the final > tree constructed?? > > Sorry that is a bit long winded, but I hope someone can point out > where I'm going wrong and set me straight. > > Thanks! > > On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu <anopheles...@gmail.com> wrote: >> Hi Matthew, >> >> The error rate reported by randomForest is the prediction error based >> on out-of-bag OOB data. Therefore, it is different from prediction >> error on the original data since each tree was built using bootstrap >> samples (about 70% of the original data), and the error rate of OOB is >> likely higher than the prediction error of the original data as you >> observed. >> >> Weidong >> >> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis >> <mattjamesfran...@gmail.com> wrote: >>> I've been using the R package randomForest but there is an aspect I >>> cannot work out the meaning of. After calling the randomForest >>> function, the returned object contains an element called prediction, >>> which is the prediction obtained using all the trees (at least that's >>> my understanding). I've checked that this prediction set has the error >>> rate as reported by err.rate. >>> >>> However, if I send the training data back into the the >>> predict.randomForest function I find I get a different result to the >>> stored set of predictions. This is true for both classification and >>> regression. I find the predictions obtained this way also have a much >>> lower error rate and perform very well (suspiciously well...) on >>> measures such as AUC. >>> >>> My understanding is that the two predictions above should be the same. >>> Since they are not, I must be not understanding something properly. >>> Any ideas what's going on? >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.