"biased regression coefficients" is nonsense. The coefficients are unbiased: their expectation (in the appropriate model) is the true value of the parameters (when estimated by, e.g. least squares).
The problem is model selection. I suggest you consult a local statistician, as you seem confused about the basic concepts. Bert Gunter Genentech Nonclinical Biostatistics On Tue, Aug 3, 2010 at 1:42 PM, Michael Haenlein <haenl...@escpeurope.eu> wrote: > Thanks for all your comments! > > @Dennis: Are there any thresholds that I can use to evaluate the Variance > Inflation Factor? I think I learned at some point that VIF should be less > than 10, but probably that is too conservative? You mentioned in your > example that a VIF of 13 is "not big enough to raise a red flag". So is the > cut-off more around 15 or 20? > > @Bert: The purpose of my regression is inference, that is to know whether > and to which extent x1, x2 and x1*x2 influence y. It's less about prediction > than about understanding the relative impact of different variables. So, if > I get your message correctly, correlation among the predictors is likely to > be an issue in my case as it leads to biased regression coefficients (which > is what I feared). > > Thanks, > > Michael > > > > -----Original Message----- > From: Bert Gunter [mailto:gunter.ber...@gene.com] > Sent: Tuesday, August 03, 2010 22:37 > To: Dennis Murphy > Cc: haenl...@gmail.com; r-help@r-project.org > Subject: Re: [R] Collinearity in Moderated Multiple Regression > > Absolutely right. > > But I think it's also worth adding that when the predictors _are_ > correlated, the estimates of their coefficients depend on which are included > in the model. This means that one should generally not try to interpret the > individual coefficients, e.g. as a way to assess their relative importance. > Rather, they should just be viewed as the machinery that produces the > prediction surface, and that is what one needs to consider to understand the > model. > > In my experience, this elementary fact is not understood by many > (most?) nonstatistical practicioners using multiple regression -- and this > ignorance gets them into a world of trouble. > > -- Bert > > Bert Gunter > Genentech Nonclinical Biostatistics > > > On Tue, Aug 3, 2010 at 12:57 PM, Dennis Murphy <djmu...@gmail.com> wrote: >> >> Hi: >> >> On Tue, Aug 3, 2010 at 6:51 AM, <haenl...@gmail.com> wrote: >> >> > I'm sorry -- I think I chose a bad example. Let me start over again: >> > >> > I want to estimate a moderated regression model of the following form: >> > y = a*x1 + b*x2 + c*x1*x2 + e >> > >> >> No intercept? What's your null model, then? >> >> >> > >> > Based on my understanding, including an interaction term (x1*x2) >> > into the regression in addition to x1 and x2 leads to issues of >> > multicollinearity, as x1*x2 is likely to covary to some degree with x1 > (and x2). >> >> >> Is it possible you're confusing interaction with multicollinearity? >> You've stated that x1 and x2 are weakly correlated; the product term >> is going to be correlated with each of its constituent covariates, but >> unless that correlation is above 0.9 (some say 0.95) in magnitude, >> multicollinearity is not really a substantive issue. As others have >> suggested, if you're concerned about multicollinearity, then fit the >> interaction model and use the vif() function from package car or elsewhere > to check for it. >> Multicollinearity has to do with ill-conditioning in the model matrix; >> interaction means that the response y is influenced by the product of >> x1 and >> x2 covariates as well as the individual covariates. They are not the >> same thing. Perhaps an example will help. >> >> Here's your x1 and x2 with a manufactured response: >> >> df <- data.frame(x1 = rep(1:3, each = 3), >> x2 = rep(1:3, 3)) >> df$y <- 0.5 + df$x1 + 1.2 * df$x2 + 2.5 * df$x1 * df$x2 + rnorm(9) # >> Response is generated to produce a significant interaction df >> x1 x2 y >> 1 1 1 5.968255 >> 2 1 2 7.566212 >> 3 1 3 13.420006 >> 4 2 1 9.025791 >> 5 2 2 16.382381 >> 6 2 3 20.923113 >> 7 3 1 11.669916 >> 8 3 2 20.714224 >> 9 3 3 31.757423 >> >> m1 <- lm(y ~ x1 * x2, data = df) >> > summary(m1) >> <snip> >> >> Coefficients: >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 2.3642 2.6214 0.902 0.40846 >> x1 -0.1200 1.2135 -0.099 0.92505 >> x2 0.2549 1.2135 0.210 0.84193 >> x1:x2 3.1589 0.5617 5.624 0.00246 ** >> --- >> Residual standard error: 1.123 on 5 degrees of freedom Multiple >> R-squared: 0.9882, Adjusted R-squared: 0.9812 >> F-statistic: 139.9 on 3 and 5 DF, p-value: 3.053e-05 >> >> # So the model has insignificant marginal covariate effects but a >> strong interaction effect. >> >> library(car) >> vif(m1) >> x1 x2 x1:x2 >> 7 7 13 >> >> # None of these is big enough to raise a red flag # re >> multicollinearity. Let's look at the correlation # matrix of the two >> covariates and their interaction. >> >> with(df, cor(cbind(x1, x2, x1 * x2))) >> x1 x2 >> x1 1.0000000 0.0000000 0.6793662 >> x2 0.0000000 1.0000000 0.6793662 >> 0.6793662 0.6793662 1.0000000 >> >> The correlation of the interaction with the other two covariates is >> 0.68, which is nowhere close to the 0.9 or above correlations that >> signal potential multicollinearity. >> >> HTH, >> Dennis >> >> >> One >> > recommendation I have seen in this context is to use mean centering, >> > but apparently this does not solve the problem (see: Echambadi, Raj >> > and James D. Hess (2007), "Mean-centering does not alleviate >> > collinearity problems in moderated multiple regression models," >> > Marketing science, 26 (3), 438 - 45). So my question is: Which R >> > function can I use to estimate this type of model. >> > >> >> > Sorry for the confusion caused due to my previous message, >> > >> > Michael >> > >> > >> > >> > >> > >> > >> > On Aug 3, 2010 3:42pm, David Winsemius <dwinsem...@comcast.net> wrote: >> > > I think you are attributing to "collinearity" a problem that is >> > > due to your small sample size. You are predicting 9 points with 3 >> > > predictor terms, and incorrectly concluding that there is some > "inconsistency" >> > > because you get an R^2 that is above some number you deem >> > > surprising. (I got values between 0.2 and 0.4 on several runs. >> > >> > >> > >> > > Try: >> > >> > > x1 >> > > x2 >> > > x3 >> > >> > >> > > y >> > > model >> > > summary(model) >> > >> > >> > >> > > # Multiple R-squared: 0.04269 >> > >> > >> > >> > > -- >> > >> > > David. >> > >> > >> > >> > > On Aug 3, 2010, at 9:10 AM, Michael Haenlein wrote: >> > >> > >> > >> > >> > > Dear all, >> > >> > >> > >> > > I have one dependent variable y and two independent variables x1 >> > > and x2 >> > >> > > which I would like to use to explain y. x1 and x2 are design >> > > factors in >> > an >> > >> > > experiment and are not correlated with each other. For example >> > > assume >> > > that: >> > >> > >> > >> > > x1 >> > > x2 >> > > cor(x1,x2) >> > >> > >> > >> > > The problem is that I do not only want to analyze the effect of x1 >> > > and x2 on >> > >> > > y but also of their interaction x1*x2. Evidently this interaction >> > > term has a >> > >> > > substantial correlation with both x1 and x2: >> > >> > >> > >> > > x3 >> > > cor(x1,x3) >> > >> > > cor(x2,x3) >> > >> > >> > >> > > I therefore expect that a simple regression of y on x1, x2 and >> > > x1*x2 will >> > >> > > lead to biased results due to multicollinearity. For example, even >> > > when y is >> > >> > > completely random and unrelated to x1 and x2, I obtain a >> > > substantial R2 for >> > >> > > a simple linear model which includes all three variables. This >> > > evidently >> > >> > > does not make sense: >> > >> > >> > >> > > y >> > > model >> > > summary(model) >> > >> > >> > >> > > Is there some function within R or in some separate library that >> > > allows >> > me >> > >> > > to estimate such a regression without obtaining inconsistent results? >> > >> > >> > >> > > Thanks for your help in advance, >> > >> > >> > >> > > Michael >> > >> > >> > >> > >> > >> > > Michael Haenlein >> > >> > > Associate Professor of Marketing >> > >> > > ESCP Europe >> > >> > > Paris, France >> > >> > >> > >> > > [[alternative HTML version deleted]] >> > >> > >> > >> > > ______________________________________________ >> > >> > > R-help@r-project.org mailing list >> > >> > > https://stat.ethz.ch/mailman/listinfo/r-help >> > >> > > PLEASE do read the posting guide >> > > http://www.R-project.org/posting-guide.html >> > >> > > and provide commented, minimal, self-contained, reproducible code. >> > >> > >> > >> > >> > > David Winsemius, MD >> > >> > > West Hartford, CT >> > >> > >> > >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-help@r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.