Hi, I have a set of measurements that are made on a daily basis over many years. I would like to produce a *non-parametric* smooth of these data to estimate the seasonal cycle - to achieve this, I have been using the cyclic cubic splines from the mgcv package. This works superbly in most situations, but not all.
The problem is that for various practical reasons the data is not available all year around at some sites - in some cases only in summer for example. The code below (hopefully) should give you an indication of the problem. #Setup set.seed(42) n <- 500 #Generate some randomly distributed observations during (southern) summer t <- rnorm(n,mean=0,sd=25)%%365 logchl <- cos(t/365*2*pi)^2 +rnorm(n,sd=0.025) #Now add a seasonal GAM library(mgcv) mdl <- gam(logchl ~ s(t,bs="cc"),knots=list(t=c(0,365))) #Plot plot(t,logchl,xlab="Day of year",ylab="Log Chl",xlim=c(0,365),xaxs="i") t.grid <- data.frame(t=seq(0,365,len=500)) lines(t.grid$t,predict(mdl,newdata=t.grid),lwd=2,col="red",xpd=NA) When you run it, you can see that the spline smoother is extrapolating well outside the range of the data during the data-sparse middle of the year. This is, after all, consistent with how a spline is defined. But in my case it causes a lot of problems - in some cases the extrapolations can be really extreme e.g. reaching +/-1e10. This is a pain. I agree that trying to produce an estimate of the seasonal cycle in this case is a bit silly - I'm doing it more for consistency with the rest of the analysis. Ideally I would like to produce something that combines the features of a gam with linear interpolation e.g. still smoothes where there is data, but in the absence of data (or low data density regions), essentially does a linear interpolation between the high-density regions. Is there anyway that this can be achieved, either in mgcv or other packages? An alternative approach might be to make the weight given to an observation proportional to the density of data in that region - I haven't tried this as yet. I'd love to hear peoples opinions these two approaches matter, or suggestions for other approaches. Best wishes, Mark List of approaches tried thus far -------------------------------------------- Removing all spline knots in the data-poor regions Adaptive smoother bases ie s(t,bs="ad") Loess smoothers Non-parametric kernel regressions (e.g. ksmooth()) [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.