[R] Gradient Boosting Trees with correlated predictors in gbm

Lars Bishop Sun, 28 Feb 2010 06:51:55 -0800

Dear R users,

Im trying to understand how correlated predictors impact the Relative
Importance measure in Stochastic Boosting Trees (J. Friedman).  As Friedman
described  with single decision trees (referring to Briemans CART
algorithm), the relative importance measure is augmented by a strategy
involving surrogate splits intended to uncover the masking of influential
variables by others highly associated with them. This strategy is most
helpful with single decision trees where the opportunity for variables to
participate in splitting is limited by the size of the tree. In the context
of Boosting, however, the number of splitting opportunities is vastly
increased, and surrogate unmasking is less essential.
Based on the results from the simulated example below, if I have, say two
variables which are highly correlated, then the relative importance measure
derived from Boosting will tend to be high for one of the predictors and low
for the other.  Im trying to reconcile this observation with Friedmans
description above, which according to my understanding, these two variables
should have about the same measure of importance. I'll appreciate your
comments.
require(gbm)
require(MASS)
#Generate multivariate random data such that X1 is moderetly correlated by
X2, strongly
# correlated with X3, and not correlated with X4 or X5.
cov.m <-
matrix(c(1,0.5,0.9,0,0,0.5,1,0.2,0,0,0.9,0.2,1,0,0,0,0,0,1,0,0,0,0,0,1),5,5,
byrow=T)
n <- 2000 # obs
X <- mvrnorm(n, rep(0, 5), cov.m)
Y <- apply(X, 1, sum)
SNR <- 10 # signal-to-noise ratio
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(n,0,sigma)
mydata <- data.frame(X,Y)
#Fit Model (should take less than 20 seconds on an average modern computer)
gbm1 <- gbm(formula = Y ~ X1 + X2 + X3 + X4 + X5,
data=mydata,
distribution = "gaussian",
n.trees = 500,
interaction.depth = 2,
n.minobsinnode = 10,
shrinkage = 0.1,
bag.fraction = 0.5,
train.fraction = 1,
cv.folds=5,
keep.data = TRUE,
verbose = TRUE)
## Plot variable influence
best.iter <- gbm.perf(gbm1, plot.it = T, method="cv")
print(best.iter)
summary(gbm1,n.trees=best.iter) # based on the estimated best number of
trees


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Gradient Boosting Trees with correlated predictors in gbm

Reply via email to