Re: [R] Poisson Regression: questions about tests of assumptions

Achim Zeileis Sun, 14 Oct 2012 23:21:52 -0700

On Sun, 14 Oct 2012, Eiko Fried wrote:

Thank you for the detailed answer, that was really helpful. I did someexcessive reading and calculating in the last hours since your reply,and have a few (hopefully much more informed) follow up questions.
1) In the vignette("countreg", package = "pscl"), LLH, AIC and BICvalues are listed for the models Negative-binomial (NB), Zero-Inflated(ZI), ZI NB, Hurdle NB, and Poisson (Standard). And although I found away to determine LLH and DF for all models, BIC & AIC values are notdisplayed by default, neither using the code given in the vignette. Howdo I calculate these? (AIC is given as per default only in 2 models, BICin none).


The code underlying the vignette can always be inspected:
edit(vignette("countreg", package = "pscl"))
The JSS also hosts a cleaned-up version of the replication code.

Most likelihood-based models provide a logLik() method which extracts thefitted log-likelihood along with the corresponding degrees of freedom.Then the default AIC() method can compute the AIC and AIC(..., k = log(n))computes the corresponding BIC. This is hinted at in Table 3 of thevignette. If "n", the number of observations, can be extracted by nobs(),then also BIC() works. This is not yet the case for zeroinfl/hurdle,though.

2) For the zero-inflated models, the first block of count modelcoefficients is only in the output in order to compare the changes,correct? That is, in my results in a paper I would only report thesecond block of (zero-inflation) model coefficients? Or do Imisunderstand something?

No, no, yes. Hurdle and zero-inflation models have two linear predictors,one for the zero hurdle/inflation and one for the truncated/un-inflatedcount component. Both are typically of interest for different aspects ofthe data.

I am
confused because in their large table to compare coefficients, they
primarily compare the first block of coefficients (p. 18)
R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
+ "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb) 
R> sapply(fm, function(x) coef(x)[1:8])

All models considered have a predictor for the mean of the countcomponent, hence this can be compared across all models.

3) 
      There are various formal tests for this, e.g., dispersiontest()
      in package "AER".
I have to run 9 models - I am testing the influence of severalpredictors on different individual symptoms of a mental disorder, as"counted" in the last week (0=never in the last week, to 3 = on all daywithin the last week). So I'm regressing the same predictors onto 9different symptoms in 9 models.

That's not really a count response. I guess an ordered response model (seee.g. polr() in MASS or package ordinal) would probably be moreappropriate.

Dispersiontest() for these 9 models result in 3-4 overdispersed models
(depending if testing one- or two-sided on p=.05 level), 2 underdispersed
models, and 4 non-significant models. The by far largest dispersion value is
2.1 in a model is not overdispersed according to the test, but that's the
symptom with 80% zeros, maybe that plays a role here. 

Does BN also make sense in underdispersed models?

The variance function of the NB2 model is mu + 1/theta * mu^2. Hence forfinite theta, the variance is always larger than the mean.

      However, overdispersion can already matter before this is
      detected by a significance test. Hence, if in doubt, I would
      simply use an NB model and you're on the safe side. And if the
      NB's estimated theta parameter turns out to be extremely large
      (say beyond 20 or 30), then you can still switch back to Poisson
      if you want.

Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here. 

4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000. 
Could you recommend me what to do here?

As I said before: A theta around 20 or 30 is already so large that it isvirtually identical to a Poisson fit (theta = Inf). These values clearlyindicate that theta is not finite.

However, this almost certainly stems from your response which is notreally count data. As I said above: An ordered response model willprobably work better here. If you have mainly variation between 0 vs.larger but not much among the levels 1/2/3, a binary response (0 vs.larger) may be best.


hth,
Z

Thanks
T

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Poisson Regression: questions about tests of assumptions

Reply via email to