Dear R gurus I am analysing data from a study of behaviour and shade utilization of chimpanzees. I am using GLMs in R (version 2.13.0) to test whether shade/sun utilization is predicted by behaviour observed. I am thus interested in whether an interaction of behaviour (as a predictor) and presence in the sun/shade (also predictor) predicts the counts I have for the respective categories.
I have my data organised as such: behaviour location specific total Travel Sun 131 303 Travel Shade 172 303 Foraging Sun 248 651 Foraging Shade 403 651 Vigilance Sun 97 224 Vigilance Shade 127 224 Rest Sun 502 1143 Rest Shade 641 1143 Abnormal Sun 33 58 Abnormal Shade 25 58 Play Sun 58 173 Play Shade 115 173 SelfGrooming Sun 183 595 SelfGrooming Shade 412 595 SocialGrooming Sun 59 358 SocialGrooming Shade 299 358 Other Sun 4 39 Other Shade 35 39 Hidden Sun 120 656 Hidden Shade 536 656 I have coded the response variable as a specific count of the times individuals were in the sun or shade, for each behaviour, out of a total number of times a specific behaviour was observed (regardless of location [sun/shade]). These are represented by the columns 'specific' and 'total' respectively. I had originally coded these values as a proportion variable, but had similar mismatch problems between R and Statistica (as described below). The GLM I am running is a binomial one (as my count response variables are divided dichotomously by the sun/shade predictor variable) with a logit link function. My problem is this: I originally ran the data through another stats program (Statistica) and got significant effects for all first- and second-order effects. When I examined the raw data, the patterns seen in the raw data suggested that these outcomes (of the GLM) conformed to the raw data (i.e. confirmed the GLM results). I then ran the * same* data through R using the following code: > behdata<-read.csv("behaviourshade.csv",header=TRUE) > behdata #Just to check that everything is there and working... > behav<-behdata$behaviour > loc<-behdata$location > prop<-behdata$proportion > spec<-behdata$specific > total<-behdata$total > model<-glm((cbind(spec,total))~behav*loc,family=binomial,data=behdata) > summary(model) Call: glm(formula = (cbind(spec, total)) ~ behav * loc, family = binomial, data = behdata) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.8416 0.2393 -3.517 0.000436 *** behavForagingfeeding 0.3620 0.2475 1.463 0.143585 behavHidden 0.6395 0.2462 2.597 0.009396 ** behavOther 0.7334 0.3338 2.197 0.028044 * behavPlay 0.4332 0.2678 1.618 0.105739 behavRest 0.2632 0.2443 1.077 0.281320 behavSelfGrooming 0.4740 0.2477 1.914 0.055644 . behavSocialGrooming 0.6615 0.2518 2.627 0.008602 ** behavTravel 0.2753 0.2576 1.069 0.285142 behavVigilance 0.2741 0.2638 1.039 0.298732 locSun 0.2776 0.3237 0.858 0.391077 behavForagingfeeding:locSun -0.7631 0.3382 -2.257 0.024036 * behavHidden:locSun -1.7743 0.3436 -5.164 2.41e-07 *** behavOther:locSun -2.4467 0.6593 -3.711 0.000206 *** behavPlay:locSun -0.9621 0.3772 -2.551 0.010752 * behavRest:locSun -0.5221 0.3318 -1.573 0.115615 behavSelfGrooming:locSun -1.0892 0.3406 -3.197 0.001387 ** behavSocialGrooming:locSun -1.9005 0.3615 -5.258 1.46e-07 *** behavTravel:locSun -0.5499 0.3533 -1.556 0.119597 behavVigilance:locSun -0.5471 0.3632 -1.506 0.131952 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 4.5553e+02 on 19 degrees of freedom Residual deviance: -1.3700e-13 on 0 degrees of freedom (19 observations deleted due to missingness) AIC: 165.12 Number of Fisher Scoring iterations: 3 When I ran it through R I got VERY different results. The R GLM suggested that behaviour and the behaviour*location interactions were significant predictors of the counts but the location (sun/shade) was not. In addition, within each factor, where differences lay were very different between tests. While I understand that it is entirely possible to have significant 2nd order interactions when 1st order effects may not be significant, these patterns described seem to defy apparently obvious patterns in the raw data. To further complicate things, I was reading through Crawley's R book where he describes the relationships between orthogonal and non-orthogonal studies and the order of factor entry; how order can complicate GLM type analyses for non-orthogonal studies. My data are orthogonal, yet the order in which I enter variables into the model influences which factors are significant and which are not. Why is this the case? Am I doing something wrong here with my data structure or coding? For example, if I switch the order from 'behav*loc' to 'loc*behav' I get yet another set of results that match neither the first R GLM results, nor the original results from the other program. I have checked the model for overdispersion and found that it is not overdispersed. What am I doing wrong here? How can the same dataset be generating such vastly different outcomes? I suspect that it may lie in the way in which the model is fitted (R does iteratively reweighted least squares whereas, Statistica may use something entirely different; what exactly, I have no clue...) but I am none the wiser in this regard, so I really don't know. Regards, in despiration... Luke *PhD Candidate* *School of Animal, Plant and Environmental Sciences* *University of the Witwatersrand* *Johannesburg, South Africa* [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.