Re: [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

jholstei Fri, 19 Apr 2013 01:42:09 -0700

Simon,

that was very instructivevery special thanks to you. 
I already noticed that the model was bad, but it was not clear to me that 
transformation of predictors to, say a more centered distribution is helpful 
here.
And thanks for pointing out Tweedie, I noticed that the error structure is far 
from normal and more like gamma or poisson, but Gamma made things worse.


Best regards,
Jan 





Am 18 Apr 2013 um 17:25 schrieb Simon Wood:

> Jan,
> 
> Thanks for the data (off list). The p-value computations are based on the 
> approximation that things are approximately normal on the linear predictor 
> scale, but actually they are no where close to normal in this case, which is 
> why the p-values look inconsistent. The reason that the approximate normality 
> assumption doesn't hold is that the model is quite a poor fit. If you take a 
> look at gam.check(fit) you'll see that the constant variance assumption of 
> quasi(link=log) is violated quite badly, and the residual distribution is 
> really quite odd (plot residuals against fitted as well). Also see 
> plot(fit,pages=1,scale=0) - it shows ballooning confidence intervals and 
> smooth estimates that are so low in places that they might as well be minus 
> infinity (given log link) - clearly something is wrong with this model!
> 
> I would be inclined to reset all the 0's to 0 (rather than 0.01), and then to 
> try Tweedie(p=1.5,link=log) as the family. Also the predictor variables are 
> very skewed which is giving leverage problems, so I would transform them to 
> give less skew. e.g. Something like
> 
> fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)),
> family=Tweedie(p=1.6,link=log),data=df,method="REML")
> 
> gives a model that is closer to being reasonable (p-values are then 
> consistent between select=TRUE and FALSE).
> 
> best,
> Simon
> 
> On 18/04/13 14:24, Simon Wood wrote:
>> Jan,
>> 
>> Thanks for this. Is there any chance that you could send me the data off
>> list and I'll try to figure out what is happening? (Under the
>> understanding that I'll only use the data for investigating this issue,
>> of course).
>> 
>> best,
>> Simon
>> 
>> on 18/04/13 11:11, Jan Holstein wrote:
>>> Simon,
>>> 
>>> thanks for the reply,  I guess I'm pretty much up to date using
>>>  mgcv 1.7-22.
>>> Upgrading to R 3.0.0 also didn't do any change.
>>> 
>>> Unfortunately using method="REML" does not make any difference:
>>> 
>>> ####### first with "select=FALSE"
>>>> fit<-gam(target
>>>> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F)
>>>> 
>>>> summary(fit)
>>> 
>>> Family: quasi
>>> Link function: log
>>> Formula:
>>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
>>> Parametric coefficients:
>>>             Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)   -4.724      7.462  -0.633    0.527
>>> Approximate significance of smooth terms:
>>>             edf Ref.df      F p-value
>>> s(mgs)    3.118  3.492  0.099   0.974
>>> s(gsd)    6.377  7.044 15.596  <2e-16 ***
>>> s(mud)    8.837  8.971 18.832  <2e-16 ***
>>> s(ssCmax) 3.886  4.051  2.342   0.052 .
>>> ---
>>> Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
>>> R-sq.(adj) =  0.403   Deviance explained = 40.6%
>>> REML score =  33186  Scale est. = 8.7812e+05  n = 4511
>>> 
>>> 
>>> 
>>> 
>>> 
>>> #### Then using "select=T"
>>> 
>>>> fit2<-gam(target
>>>> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=TRUE)
>>>> 
>>>> summary(fit2)
>>> Family: quasi
>>> Link function: log
>>> Formula:
>>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
>>> Parametric coefficients:
>>>             Estimate Std. Error t value Pr(>|t|)
>>> (Intercept)   -6.406      5.239  -1.223    0.222
>>> Approximate significance of smooth terms:
>>>             edf Ref.df     F p-value
>>> s(mgs)    2.844      8 25.43  <2e-16 ***
>>> s(gsd)    6.071      9 14.50  <2e-16 ***
>>> s(mud)    6.875      8 21.79  <2e-16 ***
>>> s(ssCmax) 3.787      8 18.42  <2e-16 ***
>>> ---
>>> Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
>>> R-sq.(adj) =    0.4   Deviance explained = 40.1%
>>> REML score =  33203  Scale est. = 8.8359e+05  n = 4511
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> I played around with other families/link functions with no success
>>> regarding
>>> the "select" behaviour.
>>> 
>>> Well, look at the structure of my data:
>>> <http://r.789695.n4.nabble.com/file/n4664586/screen-capture-1.png>
>>> 
>>> All possible predictor variables in principle look like this, and taken
>>> alone, each and every is significant according to p-value (but not all
>>> can
>>> at the same time).
>>> In theory, the target variable should be a hypersurface in 11dim space
>>> with
>>> lots of noise, but interaction of more than 2 vars gets costly (not to
>>> think
>>> of 11) and often enough (also without interaction) the solution does not
>>> converge at minimal step size. If it does, results are usually not as
>>> good
>>> as without interaction.
>>> 
>>> Any comment/advice on model setup is warmly welcome here.
>>> 
>>> Since I don't want to try out all possible 2047 combinations of up to
>>> eleven
>>> predictor variables for each target variable, I currently see no other
>>> way
>>> than educated manual guessing.
>>> 
>>> If you know another way of (semi-)automated model tunig/reduction, I
>>> would
>>> very much appreciate it
>>> 
>>> best regards,
>>> Jan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4664586.html
>>> 
>>> Sent from the R help mailing list archive at Nabble.com.
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>> 
>> 
>> 
> 
> 
> -- 
> Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
> +44 (0)1225 386603               http://people.bath.ac.uk/sw283


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Reply via email to