Re: [R] merging and obtaining the nearest value

Francesco Mon, 20 Aug 2012 04:40:41 -0700

Dear R helpers,

What to say ? thank you very much for your very interesting algorithmic
suggestions!
Everything is fine now


Have a nice day,
Best,


On 20 August 2012 03:50, William Dunlap <wdun...@tibco.com> wrote:

> Yes, if the Special_Dates are not sorted then f4 needs to sort them.
> Perhaps closestValue should just sort its vec argument.
>
> I didn't realize that the output should not have any duplicate
> entries.  I thought it should have the same number of rows
> as the input A.
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
>
> > -----Original Message-----
> > From: Rui Barradas [mailto:ruipbarra...@sapo.pt]
> > Sent: Sunday, August 19, 2012 3:54 PM
> > To: William Dunlap; Francesco
> > Cc: r-help
> > Subject: Re: [R] merging and obtaining the nearest value
> >
> > Hello,
> >
> > You're right, your solution is much faster, but it doesn't remove
> > duplicates.
> > When I ran f4() with larger datasets it poduced an error,
> >
> > Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly
> >
> > So here they all are.
> >
> > f1 <- function(A, B){
> >      m <- merge(A, B)
> >      result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)),
> > function(x){
> >          if(nrow(x)){
> >                a <- abs(x$DATE - x$Special_Date)
> >                x[which.min(a), ] }}) )
> >      result$Difference <- result$DATE - result$Special_Date
> >      result$Special_Date <- NULL
> >      rownames(result) <- seq_len(nrow(result))
> >      result
> > }
> >
> > closestValue <- function (x, vec)
> > {
> >      # for each value in x, find closest value in vec.
> >      # Break ties by using highest.
> >      # Assume vec is sorted.
> >      intervalNo <- findInterval(x, vec)
> >      lowerValue <- vec[pmax(1, intervalNo)]
> >      upperValue <- vec[pmin(length(vec), intervalNo+1)]
> >      ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
> > }
> > f4 <- function (A, B)  {
> >      A$TYPE <- as.factor(A$TYPE)
> >      uA <- levels(A$TYPE)
> >      As <- split(A$DATE, A$TYPE)
> >      B <- B[order(B$TYPE, B$Special_Date), ]
> >      Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
> >      closest <- numeric(nrow(A))
> >      split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
> >      A$Difference <- A$DATE - closest
> >      A
> > }
> >
> > # Test data, not many types
> > nA <- 1e3
> > nB <- 1e4
> > set.seed(1)
> > ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE)
> > tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE)
> >
> > aa <- data.frame(TYPE = ta, DATE = da)
> > bb <- data.frame(TYPE = tb, Special_Date = db)
> >
> > t1 <- system.time(r1 <- f1(aa, bb))
> > t4 <- system.time(r4 <- f4(aa, bb))
> > rbind(t1 = t1, t4 = t4)
> >
> > sum( duplicated(r4) )  # 165
> >
> > Rui Barradas
> > Em 19-08-2012 22:58, William Dunlap escreveu:
> > > And the following, f4,  uses the same algorithm as f2 but codes
> > > it somewhat more efficiently.  It uses the same closestValue()
> > > function.
> > > f4 <- function (A, B)  {
> > >      A$TYPE <- as.factor(A$TYPE)
> > >      uA <- levels(A$TYPE)
> > >      As <- split(A$DATE, A$TYPE)
> > >      Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA))
> > >      closest <- numeric(nrow(A))
> > >      split(closest, A$TYPE) <- mapply(closestValue, As, Bs)
> > >      A$Difference <- A$DATE - closest
> > >      A
> > > }
> > >
> > > Bill Dunlap
> > > Spotfire, TIBCO Software
> > > wdunlap tibco.com
> > >
> > >
> > >> -----Original Message-----
> > >> From: r-help-boun...@r-project.org [mailto:
> r-help-boun...@r-project.org] On
> > Behalf
> > >> Of William Dunlap
> > >> Sent: Sunday, August 19, 2012 1:49 PM
> > >> To: Francesco; r-help@r-project.org
> > >> Subject: Re: [R] merging and obtaining the nearest value
> > >>
> > >> The following, f2(A,B), should do well with lots of rows in A and B
> > >> as long as the number of types is not huge.
> > >>
> > >> f2 <- function(A, B) {
> > >>      types <- as.character(unique(A$TYPE))
> > >>      result <- numeric(nrow(A))
> > >>      Bs <- split(B$Special_Date, B$TYPE)
> > >>      for(type in types) {
> > >>          w <- A$TYPE == type
> > >>          # can omit the sort() below if you know that B$Special_Date
> is sorted.
> > >>          result[w] <- closestValue(A$DATE[w], sort(Bs[[type]]))
> > >>      }
> > >>      A$Difference <- A$DATE - result
> > >>      A
> > >> }
> > >>
> > >> closestValue <- function (x, vec)
> > >> {
> > >>      # for each value in x, find closest value in vec.
> > >>      # Break ties by using highest.
> > >>      # Assume vec is sorted.
> > >>      intervalNo <- findInterval(x, vec)
> > >>      lowerValue <- vec[pmax(1, intervalNo)]
> > >>      upperValue <- vec[pmin(length(vec), intervalNo+1)]
> > >>      ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue)
> > >> }
> > >>
> > >> Bill Dunlap
> > >> Spotfire, TIBCO Software
> > >> wdunlap tibco.com
> > >>
> > >>
> > >>> -----Original Message-----
> > >>> From: r-help-boun...@r-project.org [mailto:
> r-help-boun...@r-project.org] On
> > Behalf
> > >>> Of William Dunlap
> > >>> Sent: Sunday, August 19, 2012 9:43 AM
> > >>> To: Francesco; r-help@r-project.org
> > >>> Subject: Re: [R] merging and obtaining the nearest value
> > >>>
> > >>> How many different types are there?  Just a handful or many
> thousands?
> > >>> For this sort of problem it is often handy to write a function which
> generates
> > >>> datasets of the sort you are thinking of but parameterized by the
> > >>> number of rows, levels, etc., so you can see how the execution time
> > >>> varies with these things.
> > >>>
> > >>> If there are just a few types, try looping over types and using
> findInterval
> > >>> to see where A$Date fits into the sequence of B$Special_Date.
> > >>>
> > >>>
> > >>> Bill Dunlap
> > >>> Spotfire, TIBCO Software
> > >>> wdunlap tibco.com
> > >>>
> > >>>
> > >>>> -----Original Message-----
> > >>>> From: r-help-boun...@r-project.org [mailto:
> r-help-boun...@r-project.org] On
> > >> Behalf
> > >>>> Of Francesco
> > >>>> Sent: Sunday, August 19, 2012 4:01 AM
> > >>>> To: r-help@r-project.org
> > >>>> Subject: Re: [R] merging and obtaining the nearest value
> > >>>>
> > >>>> Dear Riu, Many thanks for your suggestion
> > >>>>
> > >>>> However these are just simplified examples... in reality the
> dataset A
> > >>>> contains millions of observations and B several thousands of rows...
> > >>>> Could I still use a modified form of your suggestion?
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>> On 19 August 2012 12:51, Rui Barradas <ruipbarra...@sapo.pt> wrote:
> > >>>>> Hello,
> > >>>>>
> > >>>>> Try the following.
> > >>>>>
> > >>>>>
> > >>>>> A <- read.table(text="
> > >>>>>
> > >>>>> TYPE   DATE
> > >>>>> A            2
> > >>>>> A            5
> > >>>>> A            20
> > >>>>> B            10
> > >>>>> B            2
> > >>>>> ", header = TRUE)
> > >>>>>
> > >>>>>
> > >>>>> B <- read.table(text="
> > >>>>>
> > >>>>> TYPE  Special_Date
> > >>>>> A              2
> > >>>>> A              6
> > >>>>> A              20
> > >>>>> A              22
> > >>>>> B              5
> > >>>>> B              6
> > >>>>> ", header = TRUE)
> > >>>>>
> > >>>>> result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE,
> m$TYPE)),
> > >>>>> function(x){
> > >>>>>          a <- abs(x$DATE - x$Special_Date)
> > >>>>>          if(nrow(x)) x[which(min(a) == a), ] }) )
> > >>>>> result$Difference <- result$DATE - result$Special_Date
> > >>>>> result$Special_Date <- NULL
> > >>>>> rownames(result) <- seq_len(nrow(result))
> > >>>>> result
> > >>>>>
> > >>>>>
> > >>>>> Also, it's a good practice to post data examples using dput(). For
> instance,
> > >>>>>
> > >>>>> dput(A)
> > >>>>> structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label =
> c("A",
> > >>>>> "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names =
> c("TYPE",
> > >>>>> "DATE"), class = "data.frame", row.names = c(NA, -5L))
> > >>>>>
> > >>>>> Now all we have to do is run the statement A <- structure(...
> etc...) to
> > >>>>> have an exact copy of the data example.
> > >>>>> Anyway, your example with input and the wanted result was very
> welcome.
> > >>>>>
> > >>>>> Hope this helps,
> > >>>>>
> > >>>>> Rui Barradas
> > >>>>>
> > >>>>> Em 19-08-2012 11:10, Francesco escreveu:
> > >>>>>> Dear R-help
> > >>>>>>
> > >>>>>> Î would like to know if there is a short solution in R for this
> > >>>>>> merging problem...
> > >>>>>>
> > >>>>>> Let say I have a dataset A as:
> > >>>>>>
> > >>>>>> TYPE   DATE
> > >>>>>> A            2
> > >>>>>> A            5
> > >>>>>> A            20
> > >>>>>> B            10
> > >>>>>> B            2
> > >>>>>>
> > >>>>>> (there can be duplicates for the same type and date)
> > >>>>>>
> > >>>>>> and I have another dataset B as :
> > >>>>>>
> > >>>>>> TYPE  Special_Date
> > >>>>>> A              2
> > >>>>>> A              6
> > >>>>>> A              20
> > >>>>>> A              22
> > >>>>>> B              5
> > >>>>>> B              6
> > >>>>>>
> > >>>>>> The question is : I would like to obtain the difference between
> the
> > >>>>>> date of each observation in A and the closest special date in B
> with
> > >>>>>> the same type. In case of ties I would take the latest date of the
> > >>>>>> two.
> > >>>>>>
> > >>>>>> For example I would obtain here
> > >>>>>>
> > >>>>>> TYPE   DATE   Difference
> > >>>>>> A            2            0=2-2
> > >>>>>> A            5            -1=5-6
> > >>>>>> A            20            0=20-20
> > >>>>>> B            10           +4=10-6
> > >>>>>> B            2             -3=2-5
> > >>>>>>
> > >>>>>> Do you know how to (simply?) obtain this in R?
> > >>>>>>
> > >>>>>> Many thanks!
> > >>>>>> Best Regards
> > >>>>>>
> > >>>>>> ______________________________________________
> > >>>>>> R-help@r-project.org mailing list
> > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>>>>> PLEASE do read the posting guide
> > >>>>>> http://www.R-project.org/posting-guide.html
> > >>>>>> and provide commented, minimal, self-contained, reproducible code.
> > >>>>>
> > >>>> ______________________________________________
> > >>>> R-help@r-project.org mailing list
> > >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > >>>> and provide commented, minimal, self-contained, reproducible code.
> > >>> ______________________________________________
> > >>> R-help@r-project.org mailing list
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > >>> and provide commented, minimal, self-contained, reproducible code.
> > >> ______________________________________________
> > >> R-help@r-project.org mailing list
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible code.
> > > ______________________________________________
> > > R-help@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] merging and obtaining the nearest value

Reply via email to