Dear List, I posted this in R-mixed and did not receive any feedback. I might post it in the wrong place. I re-post in R-help and hope to receive any suggestions and\or thoughts regarding data analysis.
The objective of the study is to investigate effects of soil properties on insect outbreaks. There are four study fields (or sites). Data were collected from 1996 through 2009. Below is sampling number per site per year. > table(alldfv3$year, alldfv3$site) 1 2 3 4 1996 256 265 222 197 1997 239 272 249 236 1998 216 239 216 222 1999 236 246 216 232 2000 237 251 247 246 2001 236 248 0 0 2002 233 249 210 208 2003 261 300 265 256 2004 268 281 278 263 2005 265 282 273 263 2006 268 283 277 263 2007 274 285 278 0 2008 276 288 280 0 2009 268 293 279 0 As you can see, data were not collected in some study sites in some years and sampling numbers per site per year are also vary. Sampling was conducted every 8 feet in each study site. Investigator recorded GPS (long and lat) of each sampling point. Within each sampling point, soil insects were collected and counted and soil data were also collected from each sampling site only in 2007. Investigator assumes that soil properties (e.g. pH, Mg, Ca, lime, etc) are relatively consistent year to year. Thus, measurements of soil properties in 2007 were applied to other years. So, GPS reading for insects and soil properties are different. To investigate the effect of soils on insect, first I ran thin plate spline regression to fit surface to soil data. And then, I obtained predicted soil properties using insect GPS readings so that I have the same GPS readings between insect and soils data. Due to the corlinearity among predictors, I conducted principal component analysis and obtained first 4 principal components accounting over 85 % total variation. Using coefficient of 4 principal components, new 4 predictors were obtained as follows (i.e., PC1 ~ PC4); > head(df.chf09) chafer year longitude latitude paddock2 PC1 PC2 PC3 PC4 12152 0 2009 -87.85045 33.90220 11 1.1790748 -2.895178 -1.522787 -2.33674507 12153 0 2009 -87.85045 33.90214 11 0.7852655 -3.758568 -1.691573 -2.16018891 12154 0 2009 -87.85046 33.90206 11 0.5025919 -4.447035 -1.794957 -1.17294024 12155 0 2009 -87.85046 33.90199 11 0.5575338 -4.587492 -1.894354 -0.24759008 12156 0 2009 -87.85046 33.90192 11 0.7923091 -4.505882 -2.194825 0.04366179 12157 0 2009 -87.85046 33.90186 11 1.0542226 -4.448612 -2.605783 0.35938751 Below is summary of an insect species (chafer). year chf.sum chf.mean chf.max chf.min chf.n 0 1 2 3 4 5 6 7 8 13 1996 1100 1.17 13 0 940 498 161 122 68 34 22 18 4 12 1 1997 250 0.25 8 0 996 855 80 35 14 6 4 1 0 1 0 1998 57 0.06 4 0 893 854 28 6 3 2 0 0 0 0 0 1999 4 0.00 1 0 930 926 4 0 0 0 0 0 0 0 0 2000 0 0.00 0 0 981 981 0 0 0 0 0 0 0 0 0 2001 16 0.03 5 0 484 474 7 2 0 0 1 0 0 0 0 2002 20 0.02 3 0 900 886 10 2 2 0 0 0 0 0 0 2003 48 0.04 5 0 1082 1050 22 7 1 1 1 0 0 0 0 2004 92 0.08 4 0 1090 1019 55 12 3 1 0 0 0 0 0 2005 236 0.22 6 0 1083 931 104 28 10 6 2 2 0 0 0 2006 30 0.03 3 0 1091 1067 20 2 2 0 0 0 0 0 0 2007 16 0.02 2 0 837 822 14 1 0 0 0 0 0 0 0 2008 28 0.03 4 0 844 822 18 3 0 1 0 0 0 0 0 2009 93 0.11 4 0 840 764 62 12 1 1 0 0 0 0 0 , where first column: year of data collection second column: sum of insect collected in each year third column: mean (sum/obs) fourth/fifth column: maximum/minimum number of insects sixth column: total observation number seventh through end of columns: frequencies of each respective insect counts. As you can see, I have data with extreme number of zeros in each year and by different insects. Even in 2000, there is no insect count at all. So, I tried fitting 'zero inflated Poisson model' to the data using cozigam function as follows: > chf.res1 <- cozigam(chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) + s(PC3) + s(PC4) + (year) + paddock2, family=poisson, data=df.chf09) iteration = 2 norm = 1.023552 iteration = 3 norm = 0.4258558 iteration = 4 norm = 0.003208923 iteration = 5 norm = 0.006238185 iteration = 6 norm = 0.006029638 iteration = 7 norm = 0.005037579 iteration = 8 norm = 0.004390153 iteration = 9 norm = 0.003970208 iteration = 10 norm = 0.003670730 iteration = 11 norm = 0.00343349 iteration = 12 norm = 0.003229298 iteration = 13 norm = 0.003043741 iteration = 14 norm = 0.002869686 iteration = 15 norm = 0.002703575 iteration = 16 norm = 0.002543601 iteration = 17 norm = 0.002388817 iteration = 18 norm = 0.002238705 iteration = 19 norm = 0.002092954 iteration = 20 norm = 0.001951361 Error in solve.default(t(X.c) %*% diag(w.c^2) %*% X.c + Lambda) : system is computationally singular: reciprocal condition number = 1.60481e-25 > chf.res1 <- cozigam(chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) + s(PC3) + s(PC4), family=poisson, data=df.chf09) iteration = 2 norm = 1.019157 iteration = 3 norm = 0.3510161 iteration = 4 norm = 0.05217535 iteration = 5 norm = 0.04782636 iteration = 6 norm = 0.03668330 iteration = 7 norm = 0.02957568 iteration = 8 norm = 0.02715226 iteration = 9 norm = 0.0258863 iteration = 10 norm = 0.02498481 iteration = 11 norm = 0.02419356 iteration = 12 norm = 0.02343089 iteration = 13 norm = 0.02267359 iteration = 14 norm = 0.02191685 iteration = 15 norm = 0.02116107 iteration = 16 norm = 0.02040789 iteration = 17 norm = 0.01965911 iteration = 18 norm = 0.01891649 iteration = 19 norm = 0.01818169 iteration = 20 norm = 0.01745623 ========================================== estimated alpha = 0.05016903 ( NaN ) estimated delta = 0.1491459 ( NaN ) ========================================== Warning messages: 1: In sqrt(V.theta[1, 1]) : NaNs produced 2: In sqrt(V.theta[2, 2]) : NaNs produced > summary(chf.res1) Family: poisson Link function: log Formula: chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) + s(PC3) + s(PC4) Parametric coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.70922 NA NA NA alpha 0.05017 NA NA NA delta1 0.14915 NA NA NA Approximate significance of smooth terms: edf Est.rank Chi.sq p-value s(longitude,latitude) 2.000 2 -0.158 1.000 s(PC1) 1.386 3 0.011 1.000 s(PC2) 1.000 1 -0.387 1.000 s(PC3) 1.737 4 1.153 0.886 s(PC4) 1.000 1 -0.191 1.000 Scale est. = 1 n = 840 This is beyond my scope of understanding and fixing problems. My questions are: 1. Am I using the right method to analyze the data in this particular case? 2. If so, how can I fix the error above? 3. If not, do you have any suggestion or thoughts to handle this kind of data sets? Besides these questions, if you share any other comments and/or suggestions, I would appreciate that. Please let me know if there is lack of information you may need before comments. Thank you very much in advance!! Steve Hong [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.