Hi, I am using the AUCRF package for my data and I was firstly impressed by the high performance of OOB-AUC. But after a while, I feel it might be due to some sort of bias, which motivates me to use random data (generated using rnorm) for a test.
The design is very simple: 100 observations with 50 in class 0 and 50 in class 1. The number of variables is something I tuned (the main idea is that if there is bias, the performance should increase with more variables). Presumably, there is no signal in the data and the true unbiased AUC should not be too different from 0.5. The results are worrisome to me: the OOB AUC is a lot higher than 0.5, and with more variables, it gets even higher. Am I misunderstanding anything here? Below is the R code I used to test: Nvar = 200 # number of variables Label = as.factor(c(rep(0,50),rep(1,50))) # class label AUC_r = NULL for (k in 1:10) { # control the randomness of generating random data set.seed(k) Arandom = matrix(rnorm(Nvar*length(Label)),nc = Nvar) DF = data.frame(Arandom,Label = Label) for (j in 1:20) { # control the randomness of OOB if (j %% 10 == 0) {cat(k,j,"\n")} set.seed(j) fit <- AUCRF(Label~., data=DF) AUC_r = cbind(AUC_r,fit$AUCcurve$AUC) } } plot(fit$AUCcurve$k,apply(AUC_r,1,mean),type = "b",pch = 3,xlab = "# of Vars", lwd = 2, col = 2,ylab = "OOB-AUC",ylim = c(0.4,1)) Thanks, -Jack [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.