Dear R helpers, What to say ? thank you very much for your very interesting algorithmic suggestions! Everything is fine now
Have a nice day, Best, On 20 August 2012 03:50, William Dunlap <wdun...@tibco.com> wrote: > Yes, if the Special_Dates are not sorted then f4 needs to sort them. > Perhaps closestValue should just sort its vec argument. > > I didn't realize that the output should not have any duplicate > entries. I thought it should have the same number of rows > as the input A. > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > -----Original Message----- > > From: Rui Barradas [mailto:ruipbarra...@sapo.pt] > > Sent: Sunday, August 19, 2012 3:54 PM > > To: William Dunlap; Francesco > > Cc: r-help > > Subject: Re: [R] merging and obtaining the nearest value > > > > Hello, > > > > You're right, your solution is much faster, but it doesn't remove > > duplicates. > > When I ran f4() with larger datasets it poduced an error, > > > > Error in findInterval(x, vec) : 'vec' must be sorted non-decreasingly > > > > So here they all are. > > > > f1 <- function(A, B){ > > m <- merge(A, B) > > result <- do.call( rbind, lapply(split(m, list(m$DATE, m$TYPE)), > > function(x){ > > if(nrow(x)){ > > a <- abs(x$DATE - x$Special_Date) > > x[which.min(a), ] }}) ) > > result$Difference <- result$DATE - result$Special_Date > > result$Special_Date <- NULL > > rownames(result) <- seq_len(nrow(result)) > > result > > } > > > > closestValue <- function (x, vec) > > { > > # for each value in x, find closest value in vec. > > # Break ties by using highest. > > # Assume vec is sorted. > > intervalNo <- findInterval(x, vec) > > lowerValue <- vec[pmax(1, intervalNo)] > > upperValue <- vec[pmin(length(vec), intervalNo+1)] > > ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue) > > } > > f4 <- function (A, B) { > > A$TYPE <- as.factor(A$TYPE) > > uA <- levels(A$TYPE) > > As <- split(A$DATE, A$TYPE) > > B <- B[order(B$TYPE, B$Special_Date), ] > > Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA)) > > closest <- numeric(nrow(A)) > > split(closest, A$TYPE) <- mapply(closestValue, As, Bs) > > A$Difference <- A$DATE - closest > > A > > } > > > > # Test data, not many types > > nA <- 1e3 > > nB <- 1e4 > > set.seed(1) > > ta <- sample(LETTERS, nA, TRUE); da <- sample(1e2, nA, TRUE) > > tb <- sample(LETTERS, nB, TRUE); db <- sample(nB, nB, TRUE) > > > > aa <- data.frame(TYPE = ta, DATE = da) > > bb <- data.frame(TYPE = tb, Special_Date = db) > > > > t1 <- system.time(r1 <- f1(aa, bb)) > > t4 <- system.time(r4 <- f4(aa, bb)) > > rbind(t1 = t1, t4 = t4) > > > > sum( duplicated(r4) ) # 165 > > > > Rui Barradas > > Em 19-08-2012 22:58, William Dunlap escreveu: > > > And the following, f4, uses the same algorithm as f2 but codes > > > it somewhat more efficiently. It uses the same closestValue() > > > function. > > > f4 <- function (A, B) { > > > A$TYPE <- as.factor(A$TYPE) > > > uA <- levels(A$TYPE) > > > As <- split(A$DATE, A$TYPE) > > > Bs <- split(B$Special_Date, factor(B$TYPE, levels = uA)) > > > closest <- numeric(nrow(A)) > > > split(closest, A$TYPE) <- mapply(closestValue, As, Bs) > > > A$Difference <- A$DATE - closest > > > A > > > } > > > > > > Bill Dunlap > > > Spotfire, TIBCO Software > > > wdunlap tibco.com > > > > > > > > >> -----Original Message----- > > >> From: r-help-boun...@r-project.org [mailto: > r-help-boun...@r-project.org] On > > Behalf > > >> Of William Dunlap > > >> Sent: Sunday, August 19, 2012 1:49 PM > > >> To: Francesco; r-help@r-project.org > > >> Subject: Re: [R] merging and obtaining the nearest value > > >> > > >> The following, f2(A,B), should do well with lots of rows in A and B > > >> as long as the number of types is not huge. > > >> > > >> f2 <- function(A, B) { > > >> types <- as.character(unique(A$TYPE)) > > >> result <- numeric(nrow(A)) > > >> Bs <- split(B$Special_Date, B$TYPE) > > >> for(type in types) { > > >> w <- A$TYPE == type > > >> # can omit the sort() below if you know that B$Special_Date > is sorted. > > >> result[w] <- closestValue(A$DATE[w], sort(Bs[[type]])) > > >> } > > >> A$Difference <- A$DATE - result > > >> A > > >> } > > >> > > >> closestValue <- function (x, vec) > > >> { > > >> # for each value in x, find closest value in vec. > > >> # Break ties by using highest. > > >> # Assume vec is sorted. > > >> intervalNo <- findInterval(x, vec) > > >> lowerValue <- vec[pmax(1, intervalNo)] > > >> upperValue <- vec[pmin(length(vec), intervalNo+1)] > > >> ifelse(x - lowerValue < upperValue - x, lowerValue, upperValue) > > >> } > > >> > > >> Bill Dunlap > > >> Spotfire, TIBCO Software > > >> wdunlap tibco.com > > >> > > >> > > >>> -----Original Message----- > > >>> From: r-help-boun...@r-project.org [mailto: > r-help-boun...@r-project.org] On > > Behalf > > >>> Of William Dunlap > > >>> Sent: Sunday, August 19, 2012 9:43 AM > > >>> To: Francesco; r-help@r-project.org > > >>> Subject: Re: [R] merging and obtaining the nearest value > > >>> > > >>> How many different types are there? Just a handful or many > thousands? > > >>> For this sort of problem it is often handy to write a function which > generates > > >>> datasets of the sort you are thinking of but parameterized by the > > >>> number of rows, levels, etc., so you can see how the execution time > > >>> varies with these things. > > >>> > > >>> If there are just a few types, try looping over types and using > findInterval > > >>> to see where A$Date fits into the sequence of B$Special_Date. > > >>> > > >>> > > >>> Bill Dunlap > > >>> Spotfire, TIBCO Software > > >>> wdunlap tibco.com > > >>> > > >>> > > >>>> -----Original Message----- > > >>>> From: r-help-boun...@r-project.org [mailto: > r-help-boun...@r-project.org] On > > >> Behalf > > >>>> Of Francesco > > >>>> Sent: Sunday, August 19, 2012 4:01 AM > > >>>> To: r-help@r-project.org > > >>>> Subject: Re: [R] merging and obtaining the nearest value > > >>>> > > >>>> Dear Riu, Many thanks for your suggestion > > >>>> > > >>>> However these are just simplified examples... in reality the > dataset A > > >>>> contains millions of observations and B several thousands of rows... > > >>>> Could I still use a modified form of your suggestion? > > >>>> > > >>>> Thanks > > >>>> > > >>>> On 19 August 2012 12:51, Rui Barradas <ruipbarra...@sapo.pt> wrote: > > >>>>> Hello, > > >>>>> > > >>>>> Try the following. > > >>>>> > > >>>>> > > >>>>> A <- read.table(text=" > > >>>>> > > >>>>> TYPE DATE > > >>>>> A 2 > > >>>>> A 5 > > >>>>> A 20 > > >>>>> B 10 > > >>>>> B 2 > > >>>>> ", header = TRUE) > > >>>>> > > >>>>> > > >>>>> B <- read.table(text=" > > >>>>> > > >>>>> TYPE Special_Date > > >>>>> A 2 > > >>>>> A 6 > > >>>>> A 20 > > >>>>> A 22 > > >>>>> B 5 > > >>>>> B 6 > > >>>>> ", header = TRUE) > > >>>>> > > >>>>> result <- do.call( rbind, lapply(split(merge(A, B), list(m$DATE, > m$TYPE)), > > >>>>> function(x){ > > >>>>> a <- abs(x$DATE - x$Special_Date) > > >>>>> if(nrow(x)) x[which(min(a) == a), ] }) ) > > >>>>> result$Difference <- result$DATE - result$Special_Date > > >>>>> result$Special_Date <- NULL > > >>>>> rownames(result) <- seq_len(nrow(result)) > > >>>>> result > > >>>>> > > >>>>> > > >>>>> Also, it's a good practice to post data examples using dput(). For > instance, > > >>>>> > > >>>>> dput(A) > > >>>>> structure(list(TYPE = structure(c(1L, 1L, 1L, 2L, 2L), .Label = > c("A", > > >>>>> "B"), class = "factor"), DATE = c(2L, 5L, 20L, 10L, 2L)), .Names = > c("TYPE", > > >>>>> "DATE"), class = "data.frame", row.names = c(NA, -5L)) > > >>>>> > > >>>>> Now all we have to do is run the statement A <- structure(... > etc...) to > > >>>>> have an exact copy of the data example. > > >>>>> Anyway, your example with input and the wanted result was very > welcome. > > >>>>> > > >>>>> Hope this helps, > > >>>>> > > >>>>> Rui Barradas > > >>>>> > > >>>>> Em 19-08-2012 11:10, Francesco escreveu: > > >>>>>> Dear R-help > > >>>>>> > > >>>>>> Î would like to know if there is a short solution in R for this > > >>>>>> merging problem... > > >>>>>> > > >>>>>> Let say I have a dataset A as: > > >>>>>> > > >>>>>> TYPE DATE > > >>>>>> A 2 > > >>>>>> A 5 > > >>>>>> A 20 > > >>>>>> B 10 > > >>>>>> B 2 > > >>>>>> > > >>>>>> (there can be duplicates for the same type and date) > > >>>>>> > > >>>>>> and I have another dataset B as : > > >>>>>> > > >>>>>> TYPE Special_Date > > >>>>>> A 2 > > >>>>>> A 6 > > >>>>>> A 20 > > >>>>>> A 22 > > >>>>>> B 5 > > >>>>>> B 6 > > >>>>>> > > >>>>>> The question is : I would like to obtain the difference between > the > > >>>>>> date of each observation in A and the closest special date in B > with > > >>>>>> the same type. In case of ties I would take the latest date of the > > >>>>>> two. > > >>>>>> > > >>>>>> For example I would obtain here > > >>>>>> > > >>>>>> TYPE DATE Difference > > >>>>>> A 2 0=2-2 > > >>>>>> A 5 -1=5-6 > > >>>>>> A 20 0=20-20 > > >>>>>> B 10 +4=10-6 > > >>>>>> B 2 -3=2-5 > > >>>>>> > > >>>>>> Do you know how to (simply?) obtain this in R? > > >>>>>> > > >>>>>> Many thanks! > > >>>>>> Best Regards > > >>>>>> > > >>>>>> ______________________________________________ > > >>>>>> R-help@r-project.org mailing list > > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>>>>> PLEASE do read the posting guide > > >>>>>> http://www.R-project.org/posting-guide.html > > >>>>>> and provide commented, minimal, self-contained, reproducible code. > > >>>>> > > >>>> ______________________________________________ > > >>>> R-help@r-project.org mailing list > > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > >>>> and provide commented, minimal, self-contained, reproducible code. > > >>> ______________________________________________ > > >>> R-help@r-project.org mailing list > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > >>> and provide commented, minimal, self-contained, reproducible code. > > >> ______________________________________________ > > >> R-help@r-project.org mailing list > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > >> and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.