Dear R users, Im trying to understand how correlated predictors impact the Relative Importance measure in Stochastic Boosting Trees (J. Friedman). As Friedman described with single decision trees (referring to Briemans CART algorithm), the relative importance measure is augmented by a strategy involving surrogate splits intended to uncover the masking of influential variables by others highly associated with them. This strategy is most helpful with single decision trees where the opportunity for variables to participate in splitting is limited by the size of the tree. In the context of Boosting, however, the number of splitting opportunities is vastly increased, and surrogate unmasking is less essential. Based on the results from the simulated example below, if I have, say two variables which are highly correlated, then the relative importance measure derived from Boosting will tend to be high for one of the predictors and low for the other. Im trying to reconcile this observation with Friedmans description above, which according to my understanding, these two variables should have about the same measure of importance. I'll appreciate your comments. require(gbm) require(MASS) #Generate multivariate random data such that X1 is moderetly correlated by X2, strongly # correlated with X3, and not correlated with X4 or X5. cov.m <- matrix(c(1,0.5,0.9,0,0,0.5,1,0.2,0,0,0.9,0.2,1,0,0,0,0,0,1,0,0,0,0,0,1),5,5, byrow=T) n <- 2000 # obs X <- mvrnorm(n, rep(0, 5), cov.m) Y <- apply(X, 1, sum) SNR <- 10 # signal-to-noise ratio sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(n,0,sigma) mydata <- data.frame(X,Y) #Fit Model (should take less than 20 seconds on an average modern computer) gbm1 <- gbm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data=mydata, distribution = "gaussian", n.trees = 500, interaction.depth = 2, n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5, train.fraction = 1, cv.folds=5, keep.data = TRUE, verbose = TRUE) ## Plot variable influence best.iter <- gbm.perf(gbm1, plot.it = T, method="cv") print(best.iter) summary(gbm1,n.trees=best.iter) # based on the estimated best number of trees
[[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.