Hi,

I am using the AUCRF package for my data and I was firstly impressed by the
high performance of OOB-AUC. But after a while, I feel it might be due to
some sort of bias, which motivates me to use random data (generated using
rnorm) for a test.

The design is very simple: 100 observations with 50 in class 0 and 50 in
class 1. The number of variables is something I tuned (the main idea is
that if there is bias, the performance should increase with more
variables).

Presumably, there is no signal in the data and the true unbiased AUC should
not be too different from 0.5.

The results are worrisome to me: the OOB AUC is a lot higher than 0.5, and
with more variables, it gets even higher.

Am I misunderstanding anything here?

Below is the R code I used to test:

Nvar = 200  # number of variables
Label = as.factor(c(rep(0,50),rep(1,50)))  # class label
AUC_r = NULL

for (k in 1:10) {  # control the randomness of generating random data
  set.seed(k)
  Arandom = matrix(rnorm(Nvar*length(Label)),nc = Nvar)
  DF = data.frame(Arandom,Label = Label)
  for (j in 1:20) {  # control the randomness of OOB
    if (j %% 10 == 0) {cat(k,j,"\n")}
    set.seed(j)
    fit <- AUCRF(Label~., data=DF)
    AUC_r = cbind(AUC_r,fit$AUCcurve$AUC)
  }
}

plot(fit$AUCcurve$k,apply(AUC_r,1,mean),type = "b",pch = 3,xlab = "# of
Vars", lwd = 2, col = 2,ylab = "OOB-AUC",ylim = c(0.4,1))


Thanks,

-Jack

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to