Hello, I am struggling to produce an MDS plot using the randomForest package with a moderately large data set. My data set has one categorical response variables, 7 predictor variables and just under 19000 observations. That means my proximity matrix is approximately 133000 by 133000 which is quite large. To train a random forest on this large a dataset I have to use my institutions high performance computer. Using this setup I was able to train a randomForest with the proximity argument set to TRUE. At this point I wanted to construct an MDSplot using the following:
MDSplot(nech.rf, nech.d$pd.fl, palette=c(1,2,3), pch=as.numeric(nech.d$pd.fl)) where "nech.rf" is the randomForest object and "nech.d$pd.fl" is the classification factor. Now with the architecture listed below, I've been waiting for approximately 2 days for this to run. My issue is that I am not sure if this will ever run. Can anyone recommend a way to tweak the MDSplot function to run a little faster? I tried changing the cmdscale arguments (i.e. eigenvalues) within the MDSplot function a little but that didn't seem to have any effect of the overall running time using a much smaller data set. Or even if someone could comment whether I am dreaming that this will actually ever run? This is probably the best computer that I will have access to so I was hoping that somehow I could get this to run. I was just hoping that someone reading the list might have some experience with randomForests and using large datasets and might be able to comment on my situation. Below the architecture information I have constructed a dummy example to illustrate what I am doing but given the nature of the problem, this doesn't completely reflect my situation. Any help would be much appreciated! Thanks! Sam ---- Computer specs and sessionInfo() OS: Suse Linux Memory: 64 GB Processors: Intel Itanium 2, 64 x 1500 MHz And: > sessionInfo() R version 2.6.2 (2008-02-08) ia64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] randomForest_4.6-6 loaded via a namespace (and not attached): [1] rcompgen_0.1-17 ### # Dummy Example ### require(randomForest) set.seed(17) ## Number of points x <- 10 df <- rbind( data.frame(var1=runif(x, 10, 50), var2=runif(x, 2, 7), var3=runif(x, 0.2, 0.35), var4=runif(x, 1, 2), var5=runif(x, 5, 8), var6=runif(x, 1, 2), var7=runif(x, 5, 8), cls=factor("CLASS-2") ) , data.frame(var1=runif(x, 10, 50), var2=runif(x, -3, 3), var3=runif(x, 0.1, 0.25), var4=runif(x, 1, 2), var5=runif(x, 5, 8), var6=runif(x, 1, 2), var7=runif(x, 5, 8), cls=factor("CLASS-1") ) ) df.rf<-randomForest(y=df[,8],x=df[,1:7], proximity=TRUE, importance=TRUE) MDSplot(df.rf, df$cls, k=2, palette=c(1,2,3,4), pch=as.numeric(df$cls)) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.