Re: [R] Need a faster function to replace missing data

William Dunlap Fri, 22 May 2009 10:23:34 -0700

Here are 2 functions, which.just.above and which.just.below,
which may help you.  They will tell which element in a reference
dataset is the first just above (or just below) each element
in the main dataset (x).  They return NA if there is no reference
element above (or below) an element of x.  The strict argument
lets you say if the inequalities are strict or if equality is
acceptable.
They are vectorized so are pretty quick.


E.g.,
   > which.just.below(c(14,14.5), 11:15, strict=TRUE)
   [1] 3 4
   > which.just.above(c(14,14.5), 11:15, strict=FALSE)
   [1] 4 5
They should work with any class of data that order() and sort()
work on.  In particular, POSIXct times work.  The attached file
has a demonstration function called 'test' with some examples.

In your case the 'reference' data would be the times at which your
backup measurements were taken and the 'x' data would be the
times of the pings.  You can look at the elements of 'reference' just
before and just after each ping (or just the pings that are missing
locations) and decide how to combine the data from the bracketing
reference elements to inpute a location for the ping.

Here are the functions, in case the attachment doesn't make it
through.  I'm sure some mailer will throw in some newlines so
it will be corrupted.

"which.just.above" <-
function(x, reference, strict = T)
{
        # output[k] will be index of smallest value in reference vector
        # larger than x[k].  If strict=F, replace 'larger than' by
        # 'larger than or equal to'.
        # We should allow NA's in x (but we don't). NA's in reference
        # should not be allowed.
        if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input")
        if(strict)
                i <- c(rep(T, length(reference)), rep(F,
length(x)))[order(
                        c(reference, x))]
        else i <- c(rep(F, length(x)), rep(T,
length(reference)))[order(c(
                        x, reference))]
        i <- cumsum(i)[!i] + 1.
        i[i > length(reference)] <- NA
        # i is length of x and has values in range 1:length(reference)
or NA
        # following needed if reference is not sorted
        i <- order(reference)[i]
        # following needed if x is not sorted
        i[order(order(x))]
}

"which.just.below" <-
function(x, reference, strict = T)
{
        # output[k] will be index of largest value in reference vector
        # less than x[k].  If strict=F, replace 'less than' by
        # 'less than or equal to'.  Neither x nor reference need be
        # sorted, although they should not have NA's (in theory, NA's
        # in x are ok, but not in reference).
        if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input")
        if(!strict)
                i <- c(rep(T, length(reference)), rep(F,
length(x)))[order(
                        c(reference, x))]
        else i <- c(rep(F, length(x)), rep(T,
length(reference)))[order(c(
                        x, reference))]
        i <- cumsum(i)[!i]
        i[i <= 0] <- NA
        # i is length of x and has values in range 1:length(reference)
or NA
        # following needed if reference is not sorted
        i <- order(reference)[i]
        # following needed if x is not sorted
        i[order(order(x))]
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-boun...@r-project.org 
> [mailto:r-help-boun...@r-project.org] On Behalf Of Tim Clark
> Sent: Thursday, May 21, 2009 9:45 PM
> To: r-help@r-project.org
> Subject: [R] Need a faster function to replace missing data
> 
> 
> Dear List,
> 
> I need some help in coming up with a function that will take 
> two data sets, determine if a value is missing in one, find a 
> value in the second that was taken at about the same time, 
> and substitute the second value in for where the first should 
> have been.  My problem is from a fish tracking study.  We put 
> acoustic tags in fish and track them for several days.  
> Location data is supposed to be automatically recorded every 
> time we detect a "ping" from the fish.  Unfortunately the GPS 
> had some problems and sometimes the fishes depth was recorded 
> but not its location.  I fortunately had a back-up GPS that 
> was taking location data every five minutes.  I would like to 
> merge the two files, replacing the missing value in the vscan 
> (automatic) file with the location from the garmin file.  
> Since we were getting vscan records every 1-2 seconds and 
> garmin records every 5 minutes, I need to find the right 
> place in the vscan file to place the garmin record - i.e. the
>  closest in time, but not greater than 5 minutes.  I have 
> written a function that does this. However, it works with my 
> test data but locks up my computer with my real data.  I have 
> several million vscan records and several thousand garmin 
> records.  Is there a better way to do this?
> 
> 
> My function and test data:
> 
> myvscan<-data.frame(c(1,NA,1.5),times(c("12:00:00","12:14:00",
> "12:20:00")))
> names(myvscan)<-c("Latitude","DateTime")
> mygarmin<-data.frame(c(20,30,40),times(("12:00:00","12:10:00",
> "12:15:00")))
> names(mygarmin)<-c("Latitude","DateTime")
> 
> minute.diff<-1/24/12   #Time diff is in days, so this is 5 minutes
> for (k in 1:nrow(myvscan))  
> {
> if (is.na(myvscan$Latitude[k]))
> {
> if ((min(abs(mygarmin$DateTime-myvscan$DateTime[k]))) < minute.diff )
> {
> index.min.date<-which.min(abs(mygarmin$DateTime-myvscan$DateTime[k]))
> myvscan$Latitude[k]<-mygarmin$Latitude[index.min.date] 
> }}}
> 
> I appreciate your help and advice.
> 
> Aloha,
> 
> Tim
> 
> 
> 
> 
> Tim Clark
> Department of Zoology 
> University of Hawaii
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

"which.just.above" <- 
function(x, reference, strict = T)
{
        # output[k] will be index of smallest value in reference vector
        # larger than x[k].  If strict=F, replace 'larger than' by
        # 'larger than or equal to'.
        # We should allow NA's in x (but we don't). NA's in reference
        # should not be allowed.
        if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input")
        if(strict)
                i <- c(rep(T, length(reference)), rep(F, length(x)))[order(
                        c(reference, x))]
        else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c(
                        x, reference))]
        i <- cumsum(i)[!i] + 1.
        i[i > length(reference)] <- NA
        # i is length of x and has values in range 1:length(reference) or NA
        # following needed if reference is not sorted 
        i <- order(reference)[i]
        # following needed if x is not sorted
        i[order(order(x))]
}

"which.just.below" <- 
function(x, reference, strict = T)
{
        # output[k] will be index of largest value in reference vector
        # less than x[k].  If strict=F, replace 'less than' by
        # 'less than or equal to'.  Neither x nor reference need be
        # sorted, although they should not have NA's (in theory, NA's
        # in x are ok, but not in reference).
        if(any(is.na(x)) || any(is.na(reference))) stop("NA's in input")
        if(!strict)
                i <- c(rep(T, length(reference)), rep(F, length(x)))[order(
                        c(reference, x))]
        else i <- c(rep(F, length(x)), rep(T, length(reference)))[order(c(
                        x, reference))]
        i <- cumsum(i)[!i]
        i[i <= 0] <- NA
        # i is length of x and has values in range 1:length(reference) or NA
        # following needed if reference is not sorted 
        i <- order(reference)[i]
        # following needed if x is not sorted
        i[order(order(x))]
}

test <- function(x, ref) {
   data.frame(
         belowStrict=ref[which.just.below(x,ref,strict=T)],
         belowNonstrict= ref[which.just.below(x,ref,strict=F)],
         x=x,
         aboveNonstrict= ref[which.just.above(x,ref,strict=F)],
         aboveStrict=ref[which.just.above(x,ref,strict=T)]
   )[order(x),]
}
ref <- 11:14
x <- c(10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5)
print(test(x,ref))
times<-as.POSIXct("2009-05-22") + seq(0, 5 * 24*60*60, len=15)
reftimes<-as.POSIXct("2009-05-23") + seq(0, 4 * 24*60*60, by=24*60*60)
print(test(times,reftimes))

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Need a faster function to replace missing data

Reply via email to