[R] Setting up 3D tensor product interactions in mgcv
Hi, I am trying to fit a smoothing model where there are three dimensions over which I can smooth (x,y,z). I expect interactions between some, or all, of these terms, and so I have set up my model as mdl <- gam(PA ~ s(x) + s(y) + s(z) + te(x,y) + te(x,z) + te(y,z) + te(x,y,z),...) I have recently read about the ti(), "tensor product interaction smoother", which takes care of these interaction terms elegantly and does the nesting properly. The help file says "This is much better than attempting the same thing with ‘s’or ‘te’ terms representing the interactions (although mgcv does not forbid it)." There is a 2D example there also. But I don't understand how I should set this up for my 3D example. Do I simply replace the te's above with ti? Or is there more to it than that? Does anyone have experience with this, and can explain how I should do it properly? Mark __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Seasonal smoothing of data with large gaps (mgcv)
Hi, I have a set of measurements that are made on a daily basis over many years. I would like to produce a *non-parametric* smooth of these data to estimate the seasonal cycle - to achieve this, I have been using the cyclic cubic splines from the mgcv package. This works superbly in most situations, but not all. The problem is that for various practical reasons the data is not available all year around at some sites - in some cases only in summer for example. The code below (hopefully) should give you an indication of the problem. #Setup set.seed(42) n <- 500 #Generate some randomly distributed observations during (southern) summer t <- rnorm(n,mean=0,sd=25)%%365 logchl <- cos(t/365*2*pi)^2 +rnorm(n,sd=0.025) #Now add a seasonal GAM library(mgcv) mdl <- gam(logchl ~ s(t,bs="cc"),knots=list(t=c(0,365))) #Plot plot(t,logchl,xlab="Day of year",ylab="Log Chl",xlim=c(0,365),xaxs="i") t.grid <- data.frame(t=seq(0,365,len=500)) lines(t.grid$t,predict(mdl,newdata=t.grid),lwd=2,col="red",xpd=NA) When you run it, you can see that the spline smoother is extrapolating well outside the range of the data during the data-sparse middle of the year. This is, after all, consistent with how a spline is defined. But in my case it causes a lot of problems - in some cases the extrapolations can be really extreme e.g. reaching +/-1e10. This is a pain. I agree that trying to produce an estimate of the seasonal cycle in this case is a bit silly - I'm doing it more for consistency with the rest of the analysis. Ideally I would like to produce something that combines the features of a gam with linear interpolation e.g. still smoothes where there is data, but in the absence of data (or low data density regions), essentially does a linear interpolation between the high-density regions. Is there anyway that this can be achieved, either in mgcv or other packages? An alternative approach might be to make the weight given to an observation proportional to the density of data in that region - I haven't tried this as yet. I'd love to hear peoples opinions these two approaches matter, or suggestions for other approaches. Best wishes, Mark List of approaches tried thus far Removing all spline knots in the data-poor regions Adaptive smoother bases ie s(t,bs="ad") Loess smoothers Non-parametric kernel regressions (e.g. ksmooth()) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Named list of data.frames to data.frame with names
Hi, I very frequently end up in a situation where I have a named list of data.frames that I wish to combine. e.g. l <- list(A=data.frame(x=rnorm(5), y=rnorm(5)), B=data.frame(x=rnorm(3),y=rnorm(3)), C=data.frame(x=rnorm(4),y=rnorm(4)), D=data.frame(x=rnorm(7),y=rnorm(7))) I would like to combine these data.frames into a single data.frame, with the column-names. This is easy with rbind and do.call l2 <- do.call(rbind,l) However, I would also like l2 to contain a column containing the names in the original list? Is there a more elegant way to do this than: l2 <- do.call(rbind,l) l2$name <- rep(names(l),times=sapply(l,nrow)) Mark __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] MRF smoothers in MGCV - specifying neighbours
Hi, Does anyone have an example of a Markov Random Field smoother (MRF) in MGCV where they have specified the neighbourhood directly, rather than supplying polygons? Does anyone understand how the rules should be? Based on the columb example, I have setup my data set and neighbourhood like so: > head(nb.l) $`10/10` [1] 135 155 153 $`10/2` [1] 27 8 6 $`10/3` [1] 48 7 28 26 $`10/4` [1] 69 27 49 47 $`10/5` [1] 48 70 68 $`10/7` [1] 115 95 93 > head(obs) x y truth x.idx y.idx xy.idx 24 1.4835147 0.8026673 2.36052041310 13/10 26 1.0452111 0.4673685 1.831674111 8 11/8 43 2.1514977 -0.2640058 -2.881202617 5 17/5 46 2.8473951 0.5445714 3.634779920 9 20/9 53 1.7983253 -0.6905912 -2.547398415 3 15/3 86 -0.1839814 -0.7824026 -0.5776616 5 25/2 > but get the following error: > mdl <- gam(truth ~ s(xy.idx,bs="mrf",xt=list(nb=nb.l)),data=obs,method="REML") Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) : mismatch between nb/polys supplied area names and data area names However, there is a perfect match between the nb list names and the data area names: > all(levels(obs$xy.idx) %in% names(nb.l)) [1] TRUE > Any suggestions where to start? Mark [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] MRF smoothers in MGCV - specifying neighbours
Hi Roger and Simon, Thanks for the replies. Simon's suggestion of an isolated or missing neighbourhood doesn't hold either. I've attached the code below - its my attempt to solve the FELSPLINE sausage using mrf rather than a soap smoother. Its a bit convoluted, but should run ok. I thought this would be a good starting example to get a GMRF running, but then hit the problem mentioned. My attempts to track the bug suggest that there is something wierd in the knots argument that is being supplied to smooth.construct.mrf.smooth.spec() - but haven't come so much further than that. Code follows. Mark #GMRF Example #Solves the classic FELSPINE problem using #GMRF in mgcv. This example is a modified version of the #example from smooth.construct.so.smooth.spec() #in the mgcv package rm(list=ls()) library(mgcv) #Extract boundary fsb <- fs.boundary() #Create an underlying grid and evaluate the function on it #Based on mgcv::fs.boundary() example dx<-0.2;dy<-0.2#Grid steps id.fmt <- "%i/%i" x.vals <- seq(-1,4,by=dx) y.vals<-seq(-1,1,by=dy) grd <- expand.grid(x=x.vals,y=y.vals) tru.mat <- matrix(fs.test(grd$x,grd$y),length(x.vals),length(y.vals)) grd$truth <- as.vector(tru.mat) grd$x.idx <- as.numeric(factor(grd$x,x.vals)) grd$y.idx <- as.numeric(factor(grd$y,y.vals)) grd$xy.idx <- sprintf(id.fmt,grd$x.idx,grd$y.idx) grd <- subset(grd,!is.na(truth)) ## Simulate some fitting data, inside boundary... n.samps<-250 x <- runif(n.samps)*5-1 y <- runif(n.samps)*2-1 obs <- data.frame(x=x,y=y) obs$truth <- fs.test(obs$x,obs$y,b=1) obs$z <- obs$truth + rnorm(n.samps)*.3 ## add noise pt.inside <- inSide(fsb,x=x,y=y) ## remove outsiders ## Associate observation with grid cell obs$x.rnd <- round(obs$x/dx)*dx obs$y.rnd <- round(obs$y/dy)*dy obs$x.idx <- as.numeric(factor(obs$x.rnd,x.vals)) obs$y.idx <- as.numeric(factor(obs$y.rnd,y.vals)) obs$xy.idx <- sprintf(id.fmt,obs$x.idx,obs$y.idx) obs$xy.idx <- factor(obs$xy.idx,levels=grd$xy.idx) #Filter observations that are outside or don't have an associated grid cell obs <- subset(obs,pt.inside & xy.idx %in% grd$xy.idx ) ## plot boundary with truth and data locations par(mfrow=c(1,2)) image(x.vals,y.vals,tru.mat,col=heat.colors(100),xlab="x",ylab="y") contour(x.vals,y.vals,tru.mat,levels=seq(-5,5,by=.25),add=TRUE) lines(fsb$x,fsb$y); points(obs$x,obs$y,pch=3); #Plot grid plot(y~x,grd) lines(fsb$x,fsb$y); #Setup neighbourhood adjancey nb <- grd[,c("x.idx","y.idx","xy.idx")] nb$N <- factor(sprintf(id.fmt,nb$x.idx,nb$y.idx+1),levels=nb$xy.idx) nb$S <- factor(sprintf(id.fmt,nb$x.idx,nb$y.idx-1),levels=nb$xy.idx) nb$E <- factor(sprintf(id.fmt,nb$x.idx+1,nb$y.idx),levels=nb$xy.idx) nb$W <- factor(sprintf(id.fmt,nb$x.idx-1,nb$y.idx),levels=nb$xy.idx) nb.mat <- sapply(nb[,c("N","S","E","W")],as.numeric) nb.l <- lapply(split(nb.mat,nb$xy.idx),function(x) x[!is.na(x)]) #Fit MRF gam mdl <- gam(z ~ s(xy.idx,bs="mrf",xt=list(nb=nb.l)),data=obs,method="REML") On 8 May 2014 15:15, Simon Wood wrote: > Hi Mark, > > I'm not sure what is happening here - there is no chance that nb.l > contains a neighbourhood not in the levels of obs$xy.idx, I suppose? i.e. is > > all(names(nb.l)%in%levels(obs$xy.idx)) > > also TRUE? Here is some code illustrating what nb should look like (and in > response to Roger Bivand's suggestion I also tried this replacing all the > labels with things like "x/y", but it still works). > > > ## example mrf fit using polygons > library(mgcv) > ## Load Columbus Ohio crime data (see ?columbus for details and credits) > data(columb) ## data frame > data(columb.polys) ## district shapes list > xt <- list(polys=columb.polys) ## neighbourhood structure info for MRF > par(mfrow=c(2,2)) > ## First a full rank MRF... > b0 <- gam(crime ~ s(district,bs="mrf",xt=xt),data=columb,method="REML") > > ## same fit based on direct neighbour spec... > nb <- mgcv:::pol2nb(columb.polys)$nb > xt <- list(nb=nb) > b <- gam(crime ~ s(district,bs="mrf",xt=xt),data=columb,method="REML") > > best, > Simon > > > > > On 08/05/14 01:58, Mark Payne wrote: > >> Hi, >> >> Does anyone have an example of a Markov Random Field smoother (MRF) in >> MGCV >> where they have specified the neighbourhood directly, rather than >> supplying >> polygons? Does anyone understand how the rules should be? Based on the >> columb example, I have setup my data set and neighbourhood like so: >> >> head(nb.l) >>> >> $`10/10` >> [1] 135 155 153 >> >> $`10/2` >> [1] 27 8 6
[R] Negative binomial parameterisation in mgcv
Dear R-help, The negative binomial distribution has several different parameterisations, but I can't seem to figure out what the exact one used in mgcv's negbin family is? negbin() requires a theta argument, but its not clear anywhere in the documentation (that I can find), how this parameter should be interpreted - for example, can it be less than 1? Best wishes, MArk __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] scatter.smooth() line colour
Hi David, Thanks for the reply. I agree - it just seemed that it was something fairly obvious that was missing from an otherwise handy little function Would it be appropriate to file it as a "request for improvement" in the bug tracking system? Mark On 22 June 2012 16:30, David L Carlson - dcarl...@tamu.edu <+r-help+trevva+3b274928d0.dcarlson#tamu@spamgourmet.com> wrote: > You are correct about scatter.smooth, but loess.smooth > (listed along with scatter.smooth on the same > help page) gives you the way to get what you want: > > x <- rnorm(25) > y <- rnorm(25) > plot(x, y) > lines(loess.smooth(x,y), col="red", lty=2, lwd=2) > > -- > David L Carlson > Associate Professor of Anthropology > Texas A&M University > College Station, TX 77843-4352 > > >> -Original Message- >> From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- >> project.org] On Behalf Of r-help.20.tre...@spamgourmet.com >> Sent: Friday, June 22, 2012 6:04 AM >> To: r-help@r-project.org >> Subject: [R] scatter.smooth() line colour >> >> Hi, >> >> I really like the scatter.smooth() function - its extremeley useful. >> However, as far as I understand it, there is no way to change the >> properties of the smoothing line e.g. col, lty, lwd. The >> scatter.smooth function accepts a ... argument, but these are not >> passed to the plotting function that actually plots the smoother - >> only to the function that plots the points. Could I please therefore >> request that an argument be added to this function to give easier >> control over line properties? e.g. line.par=list() >> >> Best wishes, >> >> Mark Payne >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- >> guide.html >> and provide commented, minimal, self-contained, reproducible code. > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Manually modifying an hclust dendrogram to remove singletons
Hi, Thanks for the replies - they have helped shaped my thinking and are starting to push me in a better direction. Maybe I should explain a little more about what I'm trying to achieve. I am analysing satellite data across the global ocean, and am interested in trying to classify areas of the ocean according to the similarity between the pixels. Singletons in this case therefore represent individual pixels that are different to the rest in terms of the similarity metric, but aren't really all that interesting in terms of the broad picture - I consider them "outliers" or "noise". However, they are annoying when it comes to splitting up the dendrogram, because I'm mainly interested in the reclassification of large areas of ocean at each step, rather than changes in the similarity. The dynamic tree-cut approach looks like a promising and sensible solution to the problem - I'll see if I can get something out of it. However, this discussion has started me wondering how I can use the spatial proximity of the pixels in the analysis - does anyone have any insights? Can the WGCNA approach be used in such a context? Best wishes, Mark Payne __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.