On Mon, 01-Mar-2010 at 12:01PM -0500, Max Kuhn wrote: |> In theory, the choice between two perfectly correlated predictors is |> random. Therefore, the importance should be "diluted" by half. |> However, this is implementation dependent. |> |> For example, run this: |> |> set.seed(1) |> n <- 100 |> p <- 10 |> |> data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n)) |> data$dup <- data[, p-1] |> |> data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n) |> |> data <- data[, sample(1:ncol(data))] |> |> str(data) |> |> library(gbm) |> fit <- gbm(y~., data = data, |> distribution = "gaussian", |> interaction.depth = 10, |> n.trees = 100, |> verbose = FALSE) |> summary(fit)
What happens if there's a third? > data$DUP <-data$dup > fit <- gbm(y~., data = data, + distribution = "gaussian", + interaction.depth = 10, + n.trees = 100, + verbose = FALSE) > summary(fit) var rel.inf 1 DUP 55.98653321 2 dup 42.99934344 3 V2 0.30763599 4 V1 0.17108839 5 V4 0.14272470 6 V3 0.13069450 7 V6 0.07839121 8 V7 0.07109805 9 V5 0.06080096 10 V8 0.05168955 11 V9 0.00000000 > So V9 which was identical to dup has now gone off the radar altogether. At first I thought that might be because 100 trees wasn't nearly enough, so I increased it to 6000 and added in some cross-validation. Doing a summary at the optimal number of trees still gives a similar result. I have to admit to being somewhat puzzled. -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___ Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) ..... Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.