You need to be _extremely_ careful when assigning levels of factors. Look at this example:
R> x1 = factor(c("a", "b", "c")) R> x2 = factor(c("a", "c", "c")) R> x3 = x2 R> levels(x3) <- levels(x1) R> x3 [1] a b b Levels: a b c I'll try to add more XXXXproofing in the code... Andy > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Haring, Tim (LWF) > Sent: Thursday, December 10, 2009 5:00 AM > To: r-help@r-project.org > Subject: [R] different randomForest performance for same data > > Hello, > > I came across a problem when building a randomForest model. > Maybe someone can help me. > I have a training- and a testdataset with a discrete response > and ten predictors (numeric and factor variables). The two > datasets are similar in terms of number of predictor, name of > variables and datatype of variables (factor, numeric) except > that only one predictor has got 20 levels in the training > dataset and only 19 levels in the test dataset. > I found that the model performance is different when train > and test a model with the unchanged datasets on the one hand > and after assigning the levels of the training dataset on the > testdataset. I only assign the levels and do not change the > dataset itself however the models perform different. > Why??? > > Here is my code: > > library(randomForest) > > load("datasets.RData") # import traindat and testdat > > nlevels(traindat$predictor1) > [1] 20 > > nlevels(testdat$predictor1) > [1] 19 > > nrow(traindat) > [1] 9838 > > nrow(testdat) > [1] 3841 > > set.seed(10) > > rf_orig <- randomForest(x=traindat[,-1], y=traindat[,1], > xtest=testdat[,-1], ytest=testdat[,1],ntree=100) > > data.frame(rf_orig$test$err.rate)[100,1] # Error on > test-dataset > [1] 0.3082531 > > # assign the levels of the training dataset th the test > dataset for predictor 1 > > levels(testdat$predictor1) <- levels(traindat$predictor1) > > nlevels(traindat$predictor1) > [1] 20 > > nlevels(testdat$predictor1) > [1] 20 > > nrow(traindat) > [1] 9838 > > nrow(testdat) > [1] 3841 > > set.seed(10) > > rf_mod <- randomForest(x=traindat[,-1], y=traindat[,1], > xtest=testdat[,-1], ytest=testdat[,1],ntree=100) > > data.frame(rf_mod$test$err.rate)[100,1] # Error on > test-dataset > [1] 0.4808644 # is different > > Cheers, > TIM > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Notice: This e-mail message, together with any attachme...{{dropped:10}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.