Simon, that was very instructivevery special thanks to you. I already noticed that the model was bad, but it was not clear to me that transformation of predictors to, say a more centered distribution is helpful here. And thanks for pointing out Tweedie, I noticed that the error structure is far from normal and more like gamma or poisson, but Gamma made things worse.
Best regards, Jan Am 18 Apr 2013 um 17:25 schrieb Simon Wood: > Jan, > > Thanks for the data (off list). The p-value computations are based on the > approximation that things are approximately normal on the linear predictor > scale, but actually they are no where close to normal in this case, which is > why the p-values look inconsistent. The reason that the approximate normality > assumption doesn't hold is that the model is quite a poor fit. If you take a > look at gam.check(fit) you'll see that the constant variance assumption of > quasi(link=log) is violated quite badly, and the residual distribution is > really quite odd (plot residuals against fitted as well). Also see > plot(fit,pages=1,scale=0) - it shows ballooning confidence intervals and > smooth estimates that are so low in places that they might as well be minus > infinity (given log link) - clearly something is wrong with this model! > > I would be inclined to reset all the 0's to 0 (rather than 0.01), and then to > try Tweedie(p=1.5,link=log) as the family. Also the predictor variables are > very skewed which is giving leverage problems, so I would transform them to > give less skew. e.g. Something like > > fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)), > family=Tweedie(p=1.6,link=log),data=df,method="REML") > > gives a model that is closer to being reasonable (p-values are then > consistent between select=TRUE and FALSE). > > best, > Simon > > On 18/04/13 14:24, Simon Wood wrote: >> Jan, >> >> Thanks for this. Is there any chance that you could send me the data off >> list and I'll try to figure out what is happening? (Under the >> understanding that I'll only use the data for investigating this issue, >> of course). >> >> best, >> Simon >> >> on 18/04/13 11:11, Jan Holstein wrote: >>> Simon, >>> >>> thanks for the reply, I guess I'm pretty much up to date using >>> mgcv 1.7-22. >>> Upgrading to R 3.0.0 also didn't do any change. >>> >>> Unfortunately using method="REML" does not make any difference: >>> >>> ####### first with "select=FALSE" >>>> fit<-gam(target >>>> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F) >>>> >>>> summary(fit) >>> >>> Family: quasi >>> Link function: log >>> Formula: >>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax) >>> Parametric coefficients: >>> Estimate Std. Error t value Pr(>|t|) >>> (Intercept) -4.724 7.462 -0.633 0.527 >>> Approximate significance of smooth terms: >>> edf Ref.df F p-value >>> s(mgs) 3.118 3.492 0.099 0.974 >>> s(gsd) 6.377 7.044 15.596 <2e-16 *** >>> s(mud) 8.837 8.971 18.832 <2e-16 *** >>> s(ssCmax) 3.886 4.051 2.342 0.052 . >>> --- >>> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 >>> R-sq.(adj) = 0.403 Deviance explained = 40.6% >>> REML score = 33186 Scale est. = 8.7812e+05 n = 4511 >>> >>> >>> >>> >>> >>> #### Then using "select=T" >>> >>>> fit2<-gam(target >>>> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=TRUE) >>>> >>>> summary(fit2) >>> Family: quasi >>> Link function: log >>> Formula: >>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax) >>> Parametric coefficients: >>> Estimate Std. Error t value Pr(>|t|) >>> (Intercept) -6.406 5.239 -1.223 0.222 >>> Approximate significance of smooth terms: >>> edf Ref.df F p-value >>> s(mgs) 2.844 8 25.43 <2e-16 *** >>> s(gsd) 6.071 9 14.50 <2e-16 *** >>> s(mud) 6.875 8 21.79 <2e-16 *** >>> s(ssCmax) 3.787 8 18.42 <2e-16 *** >>> --- >>> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 >>> R-sq.(adj) = 0.4 Deviance explained = 40.1% >>> REML score = 33203 Scale est. = 8.8359e+05 n = 4511 >>> >>> >>> >>> >>> >>> >>> >>> I played around with other families/link functions with no success >>> regarding >>> the "select" behaviour. >>> >>> Well, look at the structure of my data: >>> <http://r.789695.n4.nabble.com/file/n4664586/screen-capture-1.png> >>> >>> All possible predictor variables in principle look like this, and taken >>> alone, each and every is significant according to p-value (but not all >>> can >>> at the same time). >>> In theory, the target variable should be a hypersurface in 11dim space >>> with >>> lots of noise, but interaction of more than 2 vars gets costly (not to >>> think >>> of 11) and often enough (also without interaction) the solution does not >>> converge at minimal step size. If it does, results are usually not as >>> good >>> as without interaction. >>> >>> Any comment/advice on model setup is warmly welcome here. >>> >>> Since I don't want to try out all possible 2047 combinations of up to >>> eleven >>> predictor variables for each target variable, I currently see no other >>> way >>> than educated manual guessing. >>> >>> If you know another way of (semi-)automated model tunig/reduction, I >>> would >>> very much appreciate it >>> >>> best regards, >>> Jan >>> >>> >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4664586.html >>> >>> Sent from the R help mailing list archive at Nabble.com. >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> > > > -- > Simon Wood, Mathematical Science, University of Bath BA2 7AY UK > +44 (0)1225 386603 http://people.bath.ac.uk/sw283 [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.