[R] Fwd: helps in data analysis

Steve Hong Thu, 13 Jan 2011 19:51:41 -0800

 Dear List,

I posted this in R-mixed and did not receive any feedback.  I might post it
in the wrong place.  I re-post in R-help and hope to receive any suggestions
and\or thoughts regarding data analysis.


The objective of the study is to investigate effects of soil properties on
insect outbreaks.  There are four study fields (or sites).  Data were
collected from 1996 through 2009.  Below is sampling number per site per
year.
> table(alldfv3$year, alldfv3$site)

         1   2   3   4
  1996 256 265 222 197
  1997 239 272 249 236
  1998 216 239 216 222
  1999 236 246 216 232
  2000 237 251 247 246
  2001 236 248   0   0
  2002 233 249 210 208
  2003 261 300 265 256
  2004 268 281 278 263
  2005 265 282 273 263
  2006 268 283 277 263
  2007 274 285 278   0
  2008 276 288 280   0
  2009 268 293 279   0
As you can see, data were not collected in some study sites in some years
and sampling numbers per site per year are also vary.  Sampling was
conducted every 8 feet in each study site.  Investigator recorded GPS (long
and lat) of each sampling point.  Within each sampling point, soil insects
were collected and counted and soil data were also collected from each
sampling site only in 2007.  Investigator assumes that soil properties (e.g.
pH, Mg, Ca, lime, etc) are relatively consistent year to year.  Thus,
measurements of soil properties in 2007 were applied to other years.  So,
GPS reading for insects and soil properties are different.  To investigate
the effect of soils on insect, first I ran thin plate spline regression to
fit surface to soil data.  And then, I obtained predicted soil properties
using insect GPS readings so that I have the same GPS readings between
insect and soils data.

Due to the corlinearity among predictors, I conducted principal component
analysis and obtained first 4 principal components accounting over 85 %
total variation.  Using coefficient of 4 principal components, new 4
predictors were obtained as follows (i.e., PC1 ~ PC4);

> head(df.chf09)
      chafer year longitude latitude paddock2       PC1       PC2
PC3         PC4
12152      0 2009 -87.85045 33.90220       11 1.1790748 -2.895178 -1.522787
-2.33674507
12153      0 2009 -87.85045 33.90214       11 0.7852655 -3.758568 -1.691573
-2.16018891
12154      0 2009 -87.85046 33.90206       11 0.5025919 -4.447035 -1.794957
-1.17294024
12155      0 2009 -87.85046 33.90199       11 0.5575338 -4.587492 -1.894354
-0.24759008
12156      0 2009 -87.85046 33.90192       11 0.7923091 -4.505882 -2.194825
0.04366179
12157      0 2009 -87.85046 33.90186       11 1.0542226 -4.448612 -2.605783
0.35938751

Below is summary of an insect species (chafer).

year     chf.sum chf.mean chf.max chf.min chf.n    0   1   2  3  4  5  6 7
8 13
1996    1100     1.17      13       0   940  498 161 122 68 34 22 18 4 12  1
1997     250     0.25       8       0   996  855  80  35 14  6  4  1 0  1  0
1998      57     0.06       4       0   893  854  28   6  3  2  0  0 0  0  0
1999       4     0.00       1       0   930  926   4   0  0  0  0  0 0  0  0
2000       0     0.00       0       0   981  981   0   0  0  0  0  0 0  0  0
2001      16     0.03       5       0   484  474   7   2  0  0  1  0 0  0  0
2002      20     0.02       3       0   900  886  10   2  2  0  0  0 0  0  0
2003      48     0.04       5       0  1082 1050  22   7  1  1  1  0 0  0  0
2004      92     0.08       4       0  1090 1019  55  12  3  1  0  0 0  0  0
2005     236     0.22       6       0  1083  931 104  28 10  6  2  2 0  0  0
2006      30     0.03       3       0  1091 1067  20   2  2  0  0  0 0  0  0
2007      16     0.02       2       0   837  822  14   1  0  0  0  0 0  0  0
2008      28     0.03       4       0   844  822  18   3  0  1  0  0 0  0  0
2009      93     0.11       4       0   840  764  62  12  1  1  0  0 0  0  0

, where
first column: year of data collection
second column: sum of insect collected in each year
third column: mean (sum/obs)
fourth/fifth column: maximum/minimum number of insects
sixth column: total observation number
seventh through end of columns: frequencies of each respective insect
counts.

As you can see, I  have data with extreme number of zeros in each year and
by different insects.  Even in 2000, there is no insect count at all.  So, I
tried fitting 'zero inflated Poisson model' to the data using cozigam
function as follows:

> chf.res1 <- cozigam(chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) +
s(PC3) + s(PC4) + (year) + paddock2, family=poisson, data=df.chf09)
iteration = 2    norm = 1.023552
iteration = 3    norm = 0.4258558
iteration = 4    norm = 0.003208923
iteration = 5    norm = 0.006238185
iteration = 6    norm = 0.006029638
iteration = 7    norm = 0.005037579
iteration = 8    norm = 0.004390153
iteration = 9    norm = 0.003970208
iteration = 10   norm = 0.003670730
iteration = 11   norm = 0.00343349
iteration = 12   norm = 0.003229298
iteration = 13   norm = 0.003043741
iteration = 14   norm = 0.002869686
iteration = 15   norm = 0.002703575
iteration = 16   norm = 0.002543601
iteration = 17   norm = 0.002388817
iteration = 18   norm = 0.002238705
iteration = 19   norm = 0.002092954
iteration = 20   norm = 0.001951361
Error in solve.default(t(X.c) %*% diag(w.c^2) %*% X.c + Lambda) :
  system is computationally singular: reciprocal condition number =
1.60481e-25

> chf.res1 <- cozigam(chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) +
s(PC3) + s(PC4), family=poisson, data=df.chf09)
iteration = 2    norm = 1.019157
iteration = 3    norm = 0.3510161
iteration = 4    norm = 0.05217535
iteration = 5    norm = 0.04782636
iteration = 6    norm = 0.03668330
iteration = 7    norm = 0.02957568
iteration = 8    norm = 0.02715226
iteration = 9    norm = 0.0258863
iteration = 10   norm = 0.02498481
iteration = 11   norm = 0.02419356
iteration = 12   norm = 0.02343089
iteration = 13   norm = 0.02267359
iteration = 14   norm = 0.02191685
iteration = 15   norm = 0.02116107
iteration = 16   norm = 0.02040789
iteration = 17   norm = 0.01965911
iteration = 18   norm = 0.01891649
iteration = 19   norm = 0.01818169
iteration = 20   norm = 0.01745623
==========================================
estimated alpha = 0.05016903 ( NaN )
estimated delta = 0.1491459 ( NaN )
==========================================

Warning messages:
1: In sqrt(V.theta[1, 1]) : NaNs produced
2: In sqrt(V.theta[2, 2]) : NaNs produced
> summary(chf.res1)
Family: poisson
Link function: log
Formula:
chafer ~ s(longitude, latitude) + s(PC1) + s(PC2) + s(PC3) +
    s(PC4)
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.70922         NA      NA       NA
alpha        0.05017         NA      NA       NA
delta1       0.14915         NA      NA       NA
Approximate significance of smooth terms:
                        edf Est.rank Chi.sq p-value
s(longitude,latitude) 2.000        2 -0.158   1.000
s(PC1)                1.386        3  0.011   1.000
s(PC2)                1.000        1 -0.387   1.000
s(PC3)                1.737        4  1.153   0.886
s(PC4)                1.000        1 -0.191   1.000
Scale est. = 1         n = 840
This is beyond my scope of understanding and fixing problems.
My questions are:
1. Am I using the right method to analyze the data in this particular case?
2. If so, how can I fix the error above?
3. If not, do you have any suggestion or thoughts to handle this kind of
data sets?
Besides these questions, if you share any other comments and/or suggestions,
I would appreciate that.

Please let me know if there is lack of information you may need before
comments.

Thank you very much in advance!!

Steve Hong

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Fwd: helps in data analysis

Reply via email to