Re: [R] merging and obtaining the nearest value

Rui Barradas Sun, 19 Aug 2012 15:55:44 -0700

Hello,

You're right, your solution is much faster, but it doesn't removeduplicates.

When I ran f4() with larger datasets it poduced an error,


Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly

So here they all are.

f1 <- function(A, B){
    m <- merge(A, B)

result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)),function(x){

        if(nrow(x)){
              a <- abs(x$DATE - x$Special_Date)
              x[which.min(a), ] }}) )
    result$Difference <- result$DATE - result$Special_Date
    result$Special_Date <- NULL
    rownames(result) <- seq_len(nrow(result))
    result
}

closestValue <- function (x, vec)
{
    # for each value in x, find closest value in vec.
    # Break ties by using highest.
    # Assume vec is sorted.
    intervalNo <- findInterval(x, vec)
    lowerValue <- vec[pmax(1, intervalNo)]
    upperValue <- vec[pmin(length(vec), intervalNo+1)]
    ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}
f4 <- function (A, B)  {
    A$TYPE <- as.factor(A$TYPE)
    uA <- levels(A$TYPE)
    As <- split(A$DATE, A$TYPE)
    B <- B[order(B$TYPE, B$Special_Date), ]
    Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
    closest <- numeric(nrow(A))
    split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
    A$Difference <- A$DATE - closest
    A
}

# Test data, not many types
nA <- 1e3
nB <- 1e4
set.seed(1)
ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)

aa <- data.frame(TYPE = ta, DATE = da)
bb <- data.frame(TYPE = tb, Special_Date = db)

t1 <- system.time(r1 <- f1(aa, bb))
t4 <- system.time(r4 <- f4(aa, bb))
rbind(t1 = t1, t4 = t4)

sum( duplicated(r4) )  # 165

Rui Barradas
Em 19-08-2012 22:58, William Dunlap escreveu:

And the following, f4,  uses the same algorithm as f2 but codes
it somewhat more efficiently.  It uses the same closestValue()
function.
f4 <- function (A, B)  {
     A$TYPE <- as.factor(A$TYPE)
     uA <- levels(A$TYPE)
     As <- split(A$DATE, A$TYPE)
     Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
     closest <- numeric(nrow(A))
     split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
     A$Difference <- A$DATE - closest
     A
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf
Of William Dunlap
Sent: Sunday, August 19, 2012 1:49 PM
To: Francesco; r-help@r-project.org
Subject: Re: [R] merging and obtaining the nearest value

The following, f2(A,B), should do well with lots of rows in A and B
as long as the number of types is not huge.

f2 <- function(A, B) {
     types <- as.character(unique(A$TYPE))
     result <- numeric(nrow(A))
     Bs <- split(B$Special_Date, B$TYPE)
     for(type in types) {
         w <- A$TYPE == type
         # can omit the sort() below if you know that B$Special_Date is sorted.
         result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
     }
     A$Difference <- A$DATE - result
     A
}

closestValue <- function (x, vec)
{
     # for each value in x, find closest value in vec.
     # Break ties by using highest.
     # Assume vec is sorted.
     intervalNo <- findInterval(x, vec)
     lowerValue <- vec[pmax(1, intervalNo)]
     upperValue <- vec[pmin(length(vec), intervalNo+1)]
     ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
}

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf
Of William Dunlap
Sent: Sunday, August 19, 2012 9:43 AM
To: Francesco; r-help@r-project.org
Subject: Re: [R] merging and obtaining the nearest value

How many different types are there?  Just a handful or many thousands?
For this sort of problem it is often handy to write a function which generates
datasets of the sort you are thinking of but parameterized by the
number of rows, levels, etc., so you can see how the execution time
varies with these things.

If there are just a few types, try looping over types and using findInterval
to see where A$Date fits into the sequence of B$Special_Date.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On

Behalf

Of Francesco
Sent: Sunday, August 19, 2012 4:01 AM
To: r-help@r-project.org
Subject: Re: [R] merging and obtaining the nearest value

Dear Riu, Many thanks for your suggestion

However these are just simplified examples... in reality the dataset A
contains millions of observations and B several thousands of rows...
Could I still use a modified form of your suggestion?

Thanks

On 19 August 2012 12:51, Rui Barradas <ruipbarra...@sapo.pt> wrote:

Hello,

Try the following.


A <- read.table(text="

TYPE   DATE
A            2
A            5
A            20
B            10
B            2
", header = TRUE)


B <- read.table(text="

TYPE  Special_Date
A              2
A              6
A              20
A              22
B              5
B              6
", header = TRUE)

result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, m$TYPE)),
function(x){
         a <- abs(x$DATE - x$Special_Date)
         if(nrow(x)) x[which(min(a) == a), ] }) )
result$Difference <- result$DATE - result$Special_Date
result$Special_Date <- NULL
rownames(result) <- seq_len(nrow(result))
result


Also, it's a good practice to post data examples using dput(). For instance,

dput(A)
structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = c("TYPE",
"DATE"), class = "data.frame", row.names = c(NA, -5L))

Now all we have to do is run the statement A <- structure(... etc...) to
have an exact copy of the data example.
Anyway, your example with input and the wanted result was very welcome.

Hope this helps,

Rui Barradas

Em 19-08-2012 11:10, Francesco escreveu:

Dear R-help

Î would like to know if there is a short solution in R for this
merging problem...

Let say I have a dataset A as:

TYPE   DATE
A            2
A            5
A            20
B            10
B            2

(there can be duplicates for the same type and date)

and I have another dataset B as :

TYPE  Special_Date
A              2
A              6
A              20
A              22
B              5
B              6

The question is : I would like to obtain the difference between the
date of each observation in A and the closest special date in B with
the same type. In case of ties I would take the latest date of the
two.

For example I would obtain here

TYPE   DATE   Difference
A            2            0=2-2
A            5            -1=5-6
A            20            0=20-20
B            10           +4=10-6
B            2             -3=2-5

Do you know how to (simply?) obtain this in R?

Many thanks!
Best Regards

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging and obtaining the nearest value

Reply via email to