Here are 2 functions, which.just.above and which.just.below, which may help you. They will tell which element in a reference dataset is the first just above (or just below) each element in the main dataset (x). They return NA if there is no reference element above (or below) an element of x. The strict argument lets you say if the inequalities are strict or if equality is acceptable. They are vectorized so are pretty quick.
E.g., > which.just.below(c(14,14.5), 11:15, strict=TRUE) [1] 3 4 > which.just.above(c(14,14.5), 11:15, strict=FALSE) [1] 4 5 They should work with any class of data that order() and sort() work on. In particular, POSIXct times work. The attached file has a demonstration function called 'test' with some examples. In your case the 'reference' data would be the times at which your backup measurements were taken and the 'x' data would be the times of the pings. You can look at the elements of 'reference' just before and just after each ping (or just the pings that are missing locations) and decide how to combine the data from the bracketing reference elements to inpute a location for the ping. Here are the functions, in case the attachment doesn't make it through. I'm sure some mailer will throw in some newlines so it will be corrupted. "which.just.above" <- function(x, reference, strict = T) { # output[k] will be index of smallest value in reference vector # larger than x[k]. If strict=F, replace 'larger than' by # 'larger than or equal to'. # We should allow NA's in x (but we don't). NA's in reference # should not be allowed. if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input") if(strict) i <- c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i <- cumsum(i)[!i] + 1. i[i > length(reference)] <- NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i <- order(reference)[i] # following needed if x is not sorted i[order(order(x))] } "which.just.below" <- function(x, reference, strict = T) { # output[k] will be index of largest value in reference vector # less than x[k]. If strict=F, replace 'less than' by # 'less than or equal to'. Neither x nor reference need be # sorted, although they should not have NA's (in theory, NA's # in x are ok, but not in reference). if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input") if(!strict) i <- c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i <- cumsum(i)[!i] i[i <= 0] <- NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i <- order(reference)[i] # following needed if x is not sorted i[order(order(x))] } Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Tim Clark > Sent: Thursday, May 21, 2009 9:45 PM > To: r-help@r-project.org > Subject: [R] Need a faster function to replace missing data > > > Dear List, > > I need some help in coming up with a function that will take > two data sets, determine if a value is missing in one, find a > value in the second that was taken at about the same time, > and substitute the second value in for where the first should > have been. My problem is from a fish tracking study. We put > acoustic tags in fish and track them for several days. > Location data is supposed to be automatically recorded every > time we detect a "ping" from the fish. Unfortunately the GPS > had some problems and sometimes the fishes depth was recorded > but not its location. I fortunately had a back-up GPS that > was taking location data every five minutes. I would like to > merge the two files, replacing the missing value in the vscan > (automatic) file with the location from the garmin file. > Since we were getting vscan records every 1-2 seconds and > garmin records every 5 minutes, I need to find the right > place in the vscan file to place the garmin record - i.e. the > closest in time, but not greater than 5 minutes. I have > written a function that does this. However, it works with my > test data but locks up my computer with my real data. I have > several million vscan records and several thousand garmin > records. Is there a better way to do this? > > > My function and test data: > > myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00", > "12:20:00"))) > names(myvscan)<-c("Latitude","DateTime") > mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00", > "12:15:00"))) > names(mygarmin)<-c("Latitude","DateTime") > > minute.diff<-1/24/12 #Time diff is in days, so this is 5 minutes > for (k in 1:nrow(myvscan)) > { > if (is.na(myvscan$Latitude[k])) > { > if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < minute.diff ) > { > index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k])) > myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] > }}} > > I appreciate your help and advice. > > Aloha, > > Tim > > > > > Tim Clark > Department of Zoology > University of Hawaii > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
"which.just.above" <- function(x, reference, strict = T) { # output[k] will be index of smallest value in reference vector # larger than x[k]. If strict=F, replace 'larger than' by # 'larger than or equal to'. # We should allow NA's in x (but we don't). NA's in reference # should not be allowed. if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input") if(strict) i <- c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i <- cumsum(i)[!i] + 1. i[i > length(reference)] <- NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i <- order(reference)[i] # following needed if x is not sorted i[order(order(x))] } "which.just.below" <- function(x, reference, strict = T) { # output[k] will be index of largest value in reference vector # less than x[k]. If strict=F, replace 'less than' by # 'less than or equal to'. Neither x nor reference need be # sorted, although they should not have NA's (in theory, NA's # in x are ok, but not in reference). if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input") if(!strict) i <- c(rep(T, length(reference)), rep(F, length(x)))[order( c(reference, x))] else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c( x, reference))] i <- cumsum(i)[!i] i[i <= 0] <- NA # i is length of x and has values in range 1:length(reference) or NA # following needed if reference is not sorted i <- order(reference)[i] # following needed if x is not sorted i[order(order(x))] } test <- function(x, ref) { data.frame( belowStrict=ref[which.just.below(x,ref,strict=T)], belowNonstrict= ref[which.just.below(x,ref,strict=F)], x=x, aboveNonstrict= ref[which.just.above(x,ref,strict=F)], aboveStrict=ref[which.just.above(x,ref,strict=T)] )[order(x),] } ref <- 11:14 x <- c(10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5) print(test(x,ref)) times<-as.POSIXct("2009-05-22") + seq(0, 5 * 24*60*60, len=15) reftimes<-as.POSIXct("2009-05-23") + seq(0, 4 * 24*60*60, by=24*60*60) print(test(times,reftimes))
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.