Not only that, but in the same help page, same "Value" section, it says:
predicted the predicted values of the input data based on out-of-bag samples so people really should read the help pages instead of speculate... If the error rates were not based on OOB samples, they would drop to (near) 0 rather quickly, as each tree is intentially overfitting its training set. Andy > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Weidong Gu > Sent: Sunday, November 27, 2011 10:56 AM > To: Matthew Francis > Cc: r-help@r-project.org > Subject: Re: [R] Question about randomForest > > Matthew, > > Your intepretation of calculating error rates based on the training > data is incorrect. > > In Andy Liaw's help file "err.rate-- (classification only) vector > error rates of the prediction on the input data, the i-th element > being the (OOB) error rate for all trees up to the i-th." > > My understanding is that the error rate is calculated by throwing the > OOB cases(after a few trees, all cases in the original data would > serve as OOB for some trees) to all the trees up to the i-th which > they are OOB and get the majority vote. The plot of a rf object > indicates that OOB error declines quickly after the ensemble becomes > sizable and increase variation in trees works! ( If they are based on > the training sets, you wouldn't see such a drop since each tree is > overfitting to the training set) > > Weidong > > > On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis > <mattjamesfran...@gmail.com> wrote: > > Thanks for the help. Let me explain in more detail how I think that > > randomForest works so that you (or others) can more easily see the > > error of my ways. > > > > The function first takes a random sample of the data, of the size > > specified by the sampsize argument. With this it fully grows a tree > > resulting in a horribly over-fitted classifier for the > random sub-set. > > It then repeats this again with a different sample to generate the > > next tree and so on. > > > > Now, my understanding is that after each tree is constructed, a test > > prediction for the *whole* training data set is made by > combining the > > results of all trees (so e.g. for classification the > majority votes of > > all individual tree predictions). From this an error rate is > > determined (applicable to the ensemble applied to the training data) > > and reported in the err.rate member of the returned randomForest > > object. If you look at the error rate (or plot it using the default > > plot method) you see that it starts out very high when only > 1 or a few > > over-fitted trees are contributing, but once the forest gets larger > > the error rate drops since the ensemble is doing its job. It doesn't > > make sense to me that this error rate is for a sub-set of the data, > > since the sub-set in question changes at each step (i.e. at > each tree > > construction)? > > > > By doing cross-validation test making 'training' and 'test' > sets from > > the data I have, I do find that I get error rates on the test sets > > comparable to the error rate that is obtained from the prediction > > member of the returned randomForest object. So that does seem to be > > the 'correct' error. > > > > By my understanding the error reported for the ith tree is that > > obtained using all trees up to and including the ith tree to make an > > ensemble prediction. Therefore the final error reported > should be the > > same as that obtained using the predict.randomForest function on the > > training set, because by my understanding that should return an > > identical result to that used to generate the error rate > for the final > > tree constructed?? > > > > Sorry that is a bit long winded, but I hope someone can point out > > where I'm going wrong and set me straight. > > > > Thanks! > > > > On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu > <anopheles...@gmail.com> wrote: > >> Hi Matthew, > >> > >> The error rate reported by randomForest is the prediction > error based > >> on out-of-bag OOB data. Therefore, it is different from prediction > >> error on the original data since each tree was built > using bootstrap > >> samples (about 70% of the original data), and the error > rate of OOB is > >> likely higher than the prediction error of the original data as you > >> observed. > >> > >> Weidong > >> > >> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis > >> <mattjamesfran...@gmail.com> wrote: > >>> I've been using the R package randomForest but there is > an aspect I > >>> cannot work out the meaning of. After calling the randomForest > >>> function, the returned object contains an element called > prediction, > >>> which is the prediction obtained using all the trees (at > least that's > >>> my understanding). I've checked that this prediction set > has the error > >>> rate as reported by err.rate. > >>> > >>> However, if I send the training data back into the the > >>> predict.randomForest function I find I get a different > result to the > >>> stored set of predictions. This is true for both > classification and > >>> regression. I find the predictions obtained this way also > have a much > >>> lower error rate and perform very well (suspiciously well...) on > >>> measures such as AUC. > >>> > >>> My understanding is that the two predictions above should > be the same. > >>> Since they are not, I must be not understanding something > properly. > >>> Any ideas what's going on? > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Notice: This e-mail message, together with any attachme...{{dropped:11}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.