Thank you both for your advice. I'll follow up on it, but it is good to know that this is a known effect.
Claus On Wed, Mar 31, 2010 at 3:02 PM, Stephan Kolassa <stephan.kola...@gmx.de> wrote: > Hi Claus, > > welcome to the wonderful world of collinearity (or multicollinearity, as > some call it)! You have a near linear relationship between some of your > predictors, which can (and in your case does) lead to extreme parameter > estimates, which in some cases almost cancel out (a coefficient of +/-40 on > a categorical variable in logistic regression is a lot, and the intercept > and two of the roman parameter estimates almost cancel out) but which are > rather unstable (hence your high p-values). > > Belsley, Kuh and Welsch did some work on condition indices and variance > decomposition proportions, and variance inflation factors are quite popular > for diagnosing multicollinearity - google these terms for a bit, and > enlightenment will surely follow. > > What can you do? You should definitely think long and hard about your data. > Should you be doing separate regressions for some factor levels? Should you > drop a factor from the analysis? Should you do a categorical analogue of > Principal Components Analysis on your data before the regression? I > personally have never done this, but correspondence analysis has been > recommended as a "discrete alternative" to PCA on this list, see a couple of > books by M. J. Greenacre. > > Best of luck! > Stephan > > > claus orourke schrieb: >> >> Dear list, >> >> I want to perform a logistic regression analysis with multiple >> categorical predictors (i.e., a logit) on some data where there is a >> very definite relationship between one predicator and the >> response/independent variable. The problem I have is that in such a >> case the p value goes very high (while I as a naive newbie would >> expect it to crash towards 0). >> >> I'll illustrate my problem with some toy data. Say I have the >> following data as an input frame: >> >> roman animal colour >> 1 alpha dog black >> 2 beta cat white >> 3 alpha dog black >> 4 alpha cat black >> 5 beta dog white >> 6 alpha cat black >> 7 gamma dog white >> 8 alpha cat black >> 9 gamma dog white >> 10 beta cat white >> 11 alpha dog black >> 12 alpha cat black >> 13 gamma dog white >> 14 alpha cat black >> 15 beta dog white >> 16 beta cat black >> 17 alpha cat black >> 18 beta dog white >> >> In this toy data you can see that roman:alpha and roman:beta are >> pretty good predictors of colour >> >> Let's say I perform logistic analysis directly on the raw data with >> colour as a response variable: >> >>> options(contrasts=c("contr.treatment","contr.poly")) >>> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial) >> >> then I find that my P values for each individual level coefficient >> approach 1: >> >> Coefficients: >> Estimate Std. Error z value Pr(>|z|) >> (Intercept) -41.65 19609.49 -0.002 0.998 >> data$romanbeta 42.35 19609.49 0.002 0.998 >> data$romangamma 43.74 31089.48 0.001 0.999 >> data$animaldog 20.48 13866.00 0.001 0.999 >> >> while I expect the p value for roman:beta to be quite low because it >> is a good predictor of colour:white >> >> On the other hand, if I then run an anova with a Chi-sq test on the >> result model, I find as I would expect that 'roman' is a good >> predictor of colour. >> >>> anova(anal1,test="Chisq") >> >> Analysis of Deviance Table >> >> Model: binomial, link: logit >> >> Response: data$colour >> >> Terms added sequentially (first to last) >> >> >> Df Deviance Resid. Df Resid. Dev P(>|Chi|) >> NULL 17 24.7306 >> data$roman 2 19.3239 15 5.4067 6.366e-05 *** >> data$animal 1 1.5876 14 3.8191 0.2077 >> --- >> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >> >> Can anyone please explain why my p value is so high for the individual >> levels? >> >> Sorry for what is likely a stupid question. >> >> Claus >> >> p.s., when I run logistic analysis on data that is more 'randomised' >> everything comes out as I expect. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.