On Sun, 14 Oct 2012, Eiko Fried wrote:

Thank you for the detailed answer, that was really helpful. I did some excessive reading and calculating in the last hours since your reply, and have a few (hopefully much more informed) follow up questions.

1) In the vignette("countreg", package = "pscl"), LLH, AIC and BIC values are listed for the models Negative-binomial (NB), Zero-Inflated (ZI), ZI NB, Hurdle NB, and Poisson (Standard). And although I found a way to determine LLH and DF for all models, BIC & AIC values are not displayed by default, neither using the code given in the vignette. How do I calculate these? (AIC is given as per default only in 2 models, BIC in none).

The code underlying the vignette can always be inspected:
edit(vignette("countreg", package = "pscl"))
The JSS also hosts a cleaned-up version of the replication code.

Most likelihood-based models provide a logLik() method which extracts the fitted log-likelihood along with the corresponding degrees of freedom. Then the default AIC() method can compute the AIC and AIC(..., k = log(n)) computes the corresponding BIC. This is hinted at in Table 3 of the vignette. If "n", the number of observations, can be extracted by nobs(), then also BIC() works. This is not yet the case for zeroinfl/hurdle, though.

2) For the zero-inflated models, the first block of count model coefficients is only in the output in order to compare the changes, correct? That is, in my results in a paper I would only report the second block of (zero-inflation) model coefficients? Or do I misunderstand something?

No, no, yes. Hurdle and zero-inflation models have two linear predictors, one for the zero hurdle/inflation and one for the truncated/un-inflated count component. Both are typically of interest for different aspects of the data.

I am
confused because in their large table to compare coefficients, they
primarily compare the first block of coefficients (p. 18)
R> fm <- list("ML-Pois" = fm_pois, "Quasi-Pois" = fm_qpois, "NB" = fm_nbin,
+ "Hurdle-NB" = fm_hurdle, "ZINB" = fm_zinb) 
R> sapply(fm, function(x) coef(x)[1:8])

All models considered have a predictor for the mean of the count component, hence this can be compared across all models.


3) 
      There are various formal tests for this, e.g., dispersiontest()
      in package "AER".


I have to run 9 models - I am testing the influence of several predictors on different individual symptoms of a mental disorder, as "counted" in the last week (0=never in the last week, to 3 = on all day within the last week). So I'm regressing the same predictors onto 9 different symptoms in 9 models. 

That's not really a count response. I guess an ordered response model (see e.g. polr() in MASS or package ordinal) would probably be more appropriate.

Dispersiontest() for these 9 models result in 3-4 overdispersed models
(depending if testing one- or two-sided on p=.05 level), 2 underdispersed
models, and 4 non-significant models. The by far largest dispersion value is
2.1 in a model is not overdispersed according to the test, but that's the
symptom with 80% zeros, maybe that plays a role here. 

Does BN also make sense in underdispersed models?

The variance function of the NB2 model is mu + 1/theta * mu^2. Hence for finite theta, the variance is always larger than the mean.

      However, overdispersion can already matter before this is
      detected by a significance test. Hence, if in doubt, I would
      simply use an NB model and you're on the safe side. And if the
      NB's estimated theta parameter turns out to be extremely large
      (say beyond 20 or 30), then you can still switch back to Poisson
      if you want.

Out of the 9 models, the 3 overdispersed NB models "worked" with theta
estimates between 0.4 and 8.6. The results look just fine, so I guess NB is
appropriate here. 

4 other models (including the 2 underdispersed ones) gave warnings (theta
iteration / alternation limit reached), and although the other values look
ok (estimates, LLH, AIC), theta estimates are between 1.000 and 11.000. 
Could you recommend me what to do here? 

As I said before: A theta around 20 or 30 is already so large that it is virtually identical to a Poisson fit (theta = Inf). These values clearly indicate that theta is not finite.

However, this almost certainly stems from your response which is not really count data. As I said above: An ordered response model will probably work better here. If you have mainly variation between 0 vs. larger but not much among the levels 1/2/3, a binary response (0 vs. larger) may be best.

hth,
Z


Thanks
T

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to