Dear Experts, I have fitted MARS and GAM models on a real dataset. My goal is prediction. I have run crossvalidation many times to get an idea of the out-of-bag accuracy value. I use the Mean Squared Error (MSE) as an error evaluation criterion. I have published my paper and the reviewers ask me to do simulations. So, my goal is now to do simulations as simulation studies may be a better alternative for objectively comparing the performances of these 2 algorithms. My goal is to figure out which method (GAM or MARS) performs better (minimizing MSE) in what circumstances. I want to consider 3 different factors : n (sample size) ; the presence of Y-outliers and the presence of missing data (X-data). I want to know the influence of the sample size, the influence of the percentage of Y-outliers and the influence of the percentage of X missing data.
Sample size : n=50 ; n=100 ; n=200; n=300 and n=500 Y-outliers : 10% of Y-outliers ; 20% of Y-outliers ; 30% of Y-outliers ; 40% of Y-outliers and 50% of Y-outliers Missing data : 10% of X missing data ; 20% of X missing data ; 30% of X missing data ; 40% of X missing data and 50% of X missing data Here below are the reproducible R codes for GAM and MARS I use to calculate the MSE running cross-validation many times. How can I modify my R codes to simulate the sample size, the presence of Y-outliers and the presence of missing data ? ###MSE CROSSVALIDATION GAM (gam1) install.packages("ISLR") library(ISLR) install.packages("mgcv") library(mgcv) set.seed(123) # Create a list to store the results lst<-list() # This statement does the repetitions (looping) for(i in 1 :1000){ n=dim(Wage)[1] p=0.667 sam=sample(1 :n,floor(p*n),replace=FALSE) Training =Wage [sam,] Testing = Wage [-sam,] GAM1<-gam(wage ~education+s(age,bs="ps")+year,data=Wage) ypred=predict(GAM1,newdata=Testing) y=Testing$wage MSE = mean((y-ypred)^2) MSE lst[i]<-MSE } mean(unlist(lst)) ######## #####MSE CROSSVALIDATION MARS (Mars1) install.packages("ISLR") library(ISLR) install.packages("earth") library(earth) set.seed(123) # Create a list to store the results lst<-list() # This statement does the repetitions (looping) for(i in 1 :1000){ n=dim(Wage)[1] p=0.667 sam=sample(1 :n,floor(p*n),replace=FALSE) Training =Wage [sam,] Testing = Wage [-sam,] mars1 <- earth(wage~age+as.factor(education)+year, data=Wage) ypred=predict(mars1,newdata=Testing) y=Testing$wage MSE = mean((y-ypred)^2) MSE lst[i]<-MSE } mean(unlist(lst)) ######### ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.