Re: [R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?

Ebert,Timothy Aaron Sun, 05 Jun 2022 17:55:08 -0700

You are welcome to ask here. However, you should try contacting the authors of 
the gam package. Package authors are often extraordinarily helpful.
Tim

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of jade.shodan--- via 
R-help
Sent: Sunday, June 5, 2022 3:02 PM
To: r-help@r-project.org
Subject: [R] High concurvity/ collinearity between time and temperature in GAM 
predicting deaths but low ACF. Does this matter?

[External Email]

Hello everyone,

A few days ago I asked a question about concurvity in a GAM (the anologue of 
collinearity in a GLM) implemented in mgcv. I think my question was a bit 
unfocussed, so I am retrying again, but with additional information included 
about the autocorrelation function. I have also posted about this on Cross 
Validated. Given all the model output, it might make for easier 
reading:https://urldefense.proofpoint.com/v2/url?u=https-3A__stats.stackexchange.com_questions_577790_high-2Dconcurvity-2Dcollinearity-2Dbetween-2Dtime-2Dand-2Dtemperature-2Din-2Dgam-2Dpredicting-2Ddea&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=UnxUj1p0u7yNcAwVR0Na6FmRBibHxhuhscEcpFA2qRQ&e=

As mentioned previously, I have problems with concurvity in my thesis research, 
and don't have access to a statistician who works with time series, GAMs or R. 
I'd be very grateful for any (partial) answer, however short. I'll gladly 
return the favour where I can! For really helpful input I'd be more than happy 
to offer co-authorship on publication. Deadlines are very close, and I'm 
heading towards having no results at all if I can't solve this concurvity issue 
:(

I'm using GAMs to try to understand the relationship between deaths and 
heat-related variables (e.g. temperature and humidity), using daily time series 
over a 14-year period from a tropical, low-income country. My aim is to 
understand the relationship between these variables and deaths, rather than 
pure prediction performance.

The GAMs include distributed lag models (set up as 7-column matrices, see code 
at bottom of post), since deaths may occur over several days following exposure.

Simple GAMs with just time, lagged temperature and lagged precipitation (a 
potential confounder) show very high concurvity between lagged temperature and 
time, regardless of the many different ways I have tried to decompose time. The 
autocorrelation functions
(ACF) however, shows values close to zero, only just about breaching the 
'significance line' in a few instances. It does show patterning though, 
although the regularity is difficult to define.

My questions are:
1) Should I be worried about the high concurvity, or can I ignore it given the 
mostly non-significant ACF? I've read dozens of heat-mortality modelling 
studies and none report on concurvity between weather variables and time 
(though one 2012 paper discussed autocorrelation).

2) If I cannot ignore it, what should I do to resolve it? Would including an 
autoregressive term be appropriate, and if so, where can I find a coded example 
of how to do this? I've also come across sequential regression][1]. Would this 
be more or less appropriate? If appropriate, a pointer to an example would be 
really appreciated!

Some example GAMs are specified as follows:
```r
conc38b <- gam(deaths~te(year, month, week, weekday,
bs=c("cr","cc","cc","cc")) + heap +
                      te(temp_max, lag, k=c(10, 3)) +
                      te(precip_daily_total, lag, k=c(10, 3)),
                      data = dat, family = nb, method = 'REML', select = TRUE,
                      knots = list(month = c(0.5, 12.5), week = c(0.5, 52.5), 
weekday = c(0, 6.5))) ``` Concurvity for the above model between (temp_max, 
lag) and (year, month, week, weekday) is 0.91:

```r
$worst
                                    para te(year,month,week,weekday)
te(temp_max,lag) te(precip_daily_total,lag)
para                        1.000000e+00                1.125625e-29
     0.3150073                  0.6666348
te(year,month,week,weekday) 1.400648e-29                1.000000e+00
     0.9060552                  0.6652313
te(temp_max,lag)            3.152795e-01                8.998113e-01
     1.0000000                  0.5781015
te(precip_daily_total,lag)  6.666348e-01                6.652313e-01
     0.5805159                  1.0000000
```

Output from ```gam.check()```:
```r
Method: REML   Optimizer: outer newton
full convergence after 16 iterations.
Gradient range [-0.01467332,0.003096643] (score 8915.994 & scale 1).
Hessian positive definite, eigenvalue range [5.048053e-05,26.50175].
Model rank =  544 / 544

Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that 
k is too low, especially if edf is close to k'.

                                  k'      edf k-index p-value
te(year,month,week,weekday) 319.0000  26.6531    0.96    0.06 .
te(temp_max,lag)             29.0000   3.3681      NA      NA
te(precip_daily_total,lag)   27.0000   0.0051      NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ```

Some output from ```summary(conc38b)```:
```r
Approximate significance of smooth terms:
                                  edf Ref.df  Chi.sq p-value
te(year,month,week,weekday) 26.653127    319 166.803 < 2e-16 ***
te(temp_max,lag)             3.368129     27  11.130 0.00145 **
te(precip_daily_total,lag)   0.005104     27   0.002 0.69636
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.839   Deviance explained = 53.3%
-REML =   8916  Scale est. = 1         n = 5107
```

Below are the ACF plots (note limit y-axis = 0.1 for clarity of pattern). They 
show peaks at 5 and 15, and then there seems to be a recurring pattern at 
multiples of approx. 30 (suggesting month is not modelled adequately?). Not 
sure what would cause the spikes at 5 and 15. There is heaping of deaths on the 
15th day of each month, to which deaths with unknown date were allocated. This 
heaping was modelled with categorical variable/ factor ```heap``` with 169 
levels (0 for all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent 
heaping day over the 14-year period):

  [2]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_FzKyM.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=rhd6ZkNNDyYd1zntgAjnNzZFkYPFica9xzx9ruBHG9g&e=
  [3]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_fE3aL.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=DUqm7oXz2zc3oaDR6ESbWGKZdinIsZf-ULGgDsyIOfM&e=

I get an identical looking ACF when I decompose time into (year, month, 
monthday) as in model conc39 below, although concurvity between (temp_max, lag) 
and the time term has now dropped somewhat to 0.83:

```r
conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
                     te(temp_max, lag, k=c(10, 4)) +
                     te(precip_daily_total, lag, k=c(10, 4)),
                     data = dat, family = nb, method = 'REML', select = TRUE,
                     knots = list(month = c(0.5, 12.5))) ``` ```r

Method: REML   Optimizer: outer newton
full convergence after 14 iterations.
Gradient range [-0.001578187,6.155096e-05] (score 8915.763 & scale 1).
Hessian positive definite, eigenvalue range [1.894391e-05,26.99215].
Model rank =  323 / 323

Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that 
k is too low, especially if edf is close to k'.

                                k'     edf k-index p-value
te(year,month,monthday)    79.0000 25.0437    0.93  <2e-16 ***
te(temp_max,lag)           39.0000  4.0875      NA      NA
te(precip_daily_total,lag) 36.0000  0.0107      NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ``` Some output 
from ```summary(conc39)```:
```r
Approximate significance of smooth terms:
                                edf Ref.df  Chi.sq  p-value
te(year,month,monthday)    38.75573     99 187.231  < 2e-16 ***
te(temp_max,lag)            4.06437     37  25.927 1.66e-06 ***
te(precip_daily_total,lag)  0.01173     36   0.008    0.557
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.839   Deviance explained = 53.8%
-REML =   8915  Scale est. = 1         n = 5107
```

```r
$worst
                                   para te(year,month,monthday)
te(temp_max,lag) te(precip_daily_total,lag)
para                       1.000000e+00            3.261007e-31
0.3313549                  0.6666532
te(year,month,monthday)    3.060763e-31            1.000000e+00
0.8266086                  0.5670777
te(temp_max,lag)           3.331014e-01            8.225942e-01
1.0000000                  0.5840875
te(precip_daily_total,lag) 6.666532e-01            5.670777e-01
0.5939380                  1.0000000
```

Modelling time as ```te(year, doy)``` with a cyclic spline for doy and various 
choices for k or as ```s(time)``` with various k does not reduce concurvity 
either.

The default approach in time series studies of heat-mortality is to model time 
with fixed df, generally between 7-10 df per year of data.
I am, however, apprehensive about this approach because a) mortality profiles 
vary with locality due to sociodemographic and environmental characteristics 
and b) the choice of df is based on higher income countries (where nearly all 
these studies have been done) with different mortality profiles and so may not 
be appropriate for tropical, low-income countries.

Although the approach of fixing (high) df does remove more temporal patterns 
from the ACF (see model and output below), concurvity between time and lagged 
temperature has now risen to 0.99! Moreover, temperature (which has been a 
consistent, highly significant predictor in every model of the tens (hundreds?) 
I have run, has now turned non-significant. I am guessing this is because time 
is now a very wiggly function that not only models/ removes seasonal variation, 
but also some of the day-to-day variation that is needed for the temperature 
smooth  :

```r
conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
                      te(temp_max, lag, k=c(10,3)) +
                      te(precip_daily_total, lag, k=c(10,3)),
                      data = dat, family = nb, method = 'REML', select = TRUE) 
``` Output from ```gam.check(conc20a, rep = 1000)```:

```r
Method: REML   Optimizer: outer newton
full convergence after 9 iterations.
Gradient range [-0.0008983099,9.546022e-05] (score 8750.13 & scale 1).
Hessian positive definite, eigenvalue range [0.0001420112,15.40832].
Model rank =  336 / 336

Basis dimension (k) checking results. Low p-value (k-index<1) may indicate that 
k is too low, especially if edf is close to k'.

                                 k'      edf k-index p-value
s(time)                    111.0000 111.0000    0.98    0.56
te(temp_max,lag)            29.0000   0.6548      NA      NA
te(precip_daily_total,lag)  27.0000   0.0046      NA      NA
```
Output from ```concurvity(conc20a, full=FALSE)$worst```:

```r
                                   para      s(time) te(temp_max,lag)
te(precip_daily_total,lag)
para                       1.000000e+00 2.462064e-19        0.3165236
                0.6666348
s(time)                    2.462398e-19 1.000000e+00        0.9930674
                0.6879284
te(temp_max,lag)           3.170844e-01 9.356384e-01        1.0000000
                0.5788711
te(precip_daily_total,lag) 6.666348e-01 6.879284e-01        0.5788381
                1.0000000

```

Some output from ```summary(conc20a)```:
```r
Approximate significance of smooth terms:
                                 edf Ref.df  Chi.sq p-value
s(time)                    1.110e+02    111 419.375  <2e-16 ***
te(temp_max,lag)           6.548e-01     27   0.895   0.249
te(precip_daily_total,lag) 4.598e-03     27   0.002   0.868
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.843   Deviance explained = 56.1%
-REML = 8750.1  Scale est. = 1         n = 5107
```

ACF functions:

[4]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_7nbXS.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=hXhWH-VySi9i27hNDKK184WiLooYmhdni7_7JOLhRcI&e=
[5]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_pNnZU.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=HV6sMbzNkG-NdJZBjZRAjrvCDl2VtMPfl5rY2Ss8muM&e=

Data can be found on my [GitHub][6] site in the file 
[data_cross_validated_post2.rds][7]. A csv version is also available.
This is my code:

```r
library(readr)
library(mgcv)

df <- read_rds("data_crossvalidated_post2.rds")

# Create matrices for lagged weather variables (6 day lags) based on example by 
Simon Wood # in his 2017 book ("Generalized additive models: an introduction 
with R", p. 349) and # gamair package documentation 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__cran.r-2Dproject.org_web_packages_gamair_gamair.pdf&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=RmLMXgEhg76PLgzNq7CbZkD29EGQili4Pkd7ESctrJ0&e=
 , p. 54)

lagard <- function(x,n.lag=7) {
n <- length(x); X <- matrix(NA,n,n.lag)
for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1] X }

dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE),
deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month, weekday = 
df$weekday, week = df$week, monthday = df$monthday, time = df$time, 
heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum = 
df$precip_hourly_dailysum) dat$temp_max <- lagard(df$temp_max) dat$temp_min <- 
lagard(df$temp_min) dat$temp_mean <- lagard(df$temp_mean) dat$wbgt_max <- 
lagard(df$wbgt_max) dat$wbgt_mean <- lagard(df$wbgt_mean) dat$wbgt_min <- 
lagard(df$wbgt_min) dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max) 
dat$sh_mean <- lagard(df$sh_mean) dat$solar_mean <- lagard(df$solar_mean) 
dat$wind2m_mean <- lagard(df$wind2m_mean) dat$sh_max <- lagard(df$sh_max) 
dat$solar_max <- lagard(df$solar_max) dat$wind2m_max <- lagard(df$wind2m_max) 
dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean) 
dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum) 
dat$precip_hourly <- lagard(df$precip_hourly) dat$precip_daily_total <- lagard( 
df$precip_daily_total) dat$temp <- lagard(df$temp) dat$sh <- lagard(df$sh) 
dat$rh <- lagard(df$rh) dat$solar <- lagard(df$solar) dat$wind2m <- 
lagard(df$wind2m)

conc38b <- gam(deaths~te(year, month, week, weekday,
bs=c("cr","cc","cc","cc")) + heap +
                      te(temp_max, lag, k=c(10, 3)) +
                      te(precip_daily_total, lag, k=c(10, 3)),
                      data = dat, family = nb, method = 'REML', select = TRUE,
                      knots = list(month = c(0.5, 12.5), week = c(0.5, 52.5), 
weekday = c(0, 6.5)))

conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
                     te(temp_max, lag, k=c(10, 4)) +
                     te(precip_daily_total, lag, k=c(10, 4)),
                     data = dat, family = nb, method = 'REML', select = TRUE,
                     knots = list(month = c(0.5, 12.5)))

conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
                      te(temp_max, lag, k=c(10,3)) +
                      te(precip_daily_total, lag, k=c(10,3)),
                      data = dat, family = nb, method = 'REML', select = TRUE)

```
Thank you if you've read this far!! :-))

  [1]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__scholar.google.co.uk_scholar-3Foutput-3Dinstlink-26q-3Dinfo-3APKdjq7ZwozEJ-3Ascholar.google.com_-26hl-3Den-26as-5Fsdt-3D0-2C5-26scillfp-3D17865929886710916120-26oi-3Dlle&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=05YTFJr01J0QkoraKsVufRrfVvevvADTCCCjxskRbfY&e=
  [2]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_FzKyM.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=rhd6ZkNNDyYd1zntgAjnNzZFkYPFica9xzx9ruBHG9g&e=
  [3]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_fE3aL.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=DUqm7oXz2zc3oaDR6ESbWGKZdinIsZf-ULGgDsyIOfM&e=
  [4]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_7nbXS.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=hXhWH-VySi9i27hNDKK184WiLooYmhdni7_7JOLhRcI&e=
  [5]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__i.stack.imgur.com_pNnZU.png&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=HV6sMbzNkG-NdJZBjZRAjrvCDl2VtMPfl5rY2Ss8muM&e=
  [6]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JadeShodan_heat-2Dmortality&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=zxKWkmFT-DWpRAx6t0DRbV9ldSbgyIE6V3LdJBm4ULU&e=
  [7]: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JadeShodan_heat-2Dmortality_blob_main_data-5Fcross-5Fvalidated-5Fpost2.rds&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=VIm9gkrPHFVXJFZeg6sMnKYGcLD5HR3BKqh4z8iIQjc&e=

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=QP0xRvlj8tppy7hmTxhzrD62Hd1N4mXYmPKW48XiiRg&e=
PLEASE do read the posting guide 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwIFaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=OcAaOvxwBr8hQWb7DaBa1490imjSyKQGzC2TbT3r6Xq3uAna_UgeAo-SNr5SwaYD&s=M7FO2Q09m7Or14GXa16BtzX1E8Xa5AWB-eN9ZTGlcjA&e=
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?

Reply via email to