And here is a pure R solution: > m <- merge(df1, df2, by = "category") > m$datediff <- m$date.x - m$date.y > m <- m[order(m$category, m$date.x, m$date.y), ] > m category A.x date.x A.y date.y datediff 2 1 124 2003-02-08 28 2003-05-17 -98 days 1 1 124 2003-02-08 116 2003-11-29 -294 days 6 1 22 2008-08-16 28 2003-05-17 1918 days 5 1 22 2008-08-16 116 2003-11-29 1722 days 4 1 96 2008-11-29 28 2003-05-17 2023 days 3 1 96 2008-11-29 116 2003-11-29 1827 days 10 2 18 2001-12-01 12 2005-02-26 -1183 days 9 2 18 2001-12-01 6 2008-10-25 -2520 days 8 2 150 2002-01-12 12 2005-02-26 -1141 days 7 2 150 2002-01-12 6 2008-10-25 -2478 days 14 3 24 2003-09-13 109 2005-10-01 -749 days 13 3 24 2003-09-13 92 2007-08-18 -1435 days 12 3 175 2009-08-01 109 2005-10-01 1400 days 11 3 175 2009-08-01 92 2007-08-18 714 days 24 4 126 2000-11-04 65 2000-11-18 -14 days 26 4 126 2000-11-04 91 2003-05-10 -917 days 25 4 126 2000-11-04 15 2003-07-26 -994 days 23 4 126 2000-11-04 54 2008-11-22 -2940 days 16 4 70 2004-03-13 65 2000-11-18 1211 days 18 4 70 2004-03-13 91 2003-05-10 308 days 17 4 70 2004-03-13 15 2003-07-26 231 days 15 4 70 2004-03-13 54 2008-11-22 -1715 days 20 4 64 2007-06-02 65 2000-11-18 2387 days 22 4 64 2007-06-02 91 2003-05-10 1484 days 21 4 64 2007-06-02 15 2003-07-26 1407 days 19 4 64 2007-06-02 54 2008-11-22 -539 days
On Fri, May 14, 2010 at 6:38 PM, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > Generating df1 and df2 as in your post try this (and see > http://sqldf.googlecode.com for more info): > >> library(sqldf) >> out <- sqldf("select category, > + df1.date date1, > + df2.date date2, > + df1.date - df2.date datediff > + from df1 join df2 using(category) > + order by category, date1, date2") >> >> out[[2]] <- as.Date(out[[2]], origin = "1970-01-01") >> out[[3]] <- as.Date(out[[3]], origin = "1970-01-01") >> out > category date1 date2 datediff > 1 1 2003-02-08 2003-05-17 -98 > 2 1 2003-02-08 2003-11-29 -294 > 3 1 2008-08-16 2003-05-17 1918 > 4 1 2008-08-16 2003-11-29 1722 > 5 1 2008-11-29 2003-05-17 2023 > 6 1 2008-11-29 2003-11-29 1827 > 7 2 2001-12-01 2005-02-26 -1183 > 8 2 2001-12-01 2008-10-25 -2520 > 9 2 2002-01-12 2005-02-26 -1141 > 10 2 2002-01-12 2008-10-25 -2478 > 11 3 2003-09-13 2005-10-01 -749 > 12 3 2003-09-13 2007-08-18 -1435 > 13 3 2009-08-01 2005-10-01 1400 > 14 3 2009-08-01 2007-08-18 714 > 15 4 2000-11-04 2000-11-18 -14 > 16 4 2000-11-04 2003-05-10 -917 > 17 4 2000-11-04 2003-07-26 -994 > 18 4 2000-11-04 2008-11-22 -2940 > 19 4 2004-03-13 2000-11-18 1211 > 20 4 2004-03-13 2003-05-10 308 > 21 4 2004-03-13 2003-07-26 231 > 22 4 2004-03-13 2008-11-22 -1715 > 23 4 2007-06-02 2000-11-18 2387 > 24 4 2007-06-02 2003-05-10 1484 > 25 4 2007-06-02 2003-07-26 1407 > 26 4 2007-06-02 2008-11-22 -539 > > On Fri, May 14, 2010 at 6:13 PM, Jonathan <jonsle...@gmail.com> wrote: >> Hi All, >> I've come up with a solution for this problem that relies on a for loop, >> and I was wondering if anybody had any insight into a more elegant method: >> >> I have two data frames, each has a column for categorical data and a column >> for date. What I'd like to do, ideally, is calculate the number of days >> between all pairs of dates in data frame 1 and data frame 2 (*but only for >> members of the same category*). The number of members of each category >> varies between the two data frames. >> >> For example: >> >> >>> d <- seq(as.Date("2000-02-12"), as.Date("2009-08-18"), by="weeks") >> >>> df1 <- data.frame('A'=sample(1:200,10), >>> 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE)) >> >>> df2 <- data.frame('A'=sample(1:200,10), >>> 'date'=d[sample(1:length(d),10)],'category'=sample(1:4,10,replace=TRUE)) >> >> >>> df1 >> A date category >> 1 93 2004-02-28 3 >> 2 105 2001-03-17 3 >> 3 189 2009-07-04 2 >> 4 130 2003-07-05 2 >> 5 160 2005-09-24 2 >> 6 32 2004-11-06 2 >> 7 117 2007-03-17 1 >> 8 161 2003-07-19 4 >> 9 153 2001-09-15 3 >> 10 173 2005-08-27 1 >> >> >>> df2 >> A date category >> 1 102 2006-08-19 3 >> 2 68 2004-11-27 2 >> 3 137 2003-01-11 1 >> 4 39 2002-12-28 2 >> 5 127 2004-03-06 4 >> 6 125 2002-02-23 2 >> 7 150 2002-05-18 4 >> 8 19 2003-02-22 1 >> 9 80 2000-08-05 1 >> 10 94 2003-12-27 1 >> >> >> Within a loop, I'd do the following (i is my counter; for the example, >> I set it to 1): >> >> >>> i<-1 >> >> # Create the data frames: >> >>> yeari_1 <- df1[which(df1['category']==i),]; yeari_2 <- >>> df2[which(df2['category']==i),] >> >> # Select only the data from category i >> >>> yeari_1 >> A date category >> 7 117 2007-03-17 1 >> 10 173 2005-08-27 1 >> >>> yeari_2 >> A date category >> 3 137 2003-01-11 1 >> 8 19 2003-02-22 1 >> 9 80 2000-08-05 1 >> 10 94 2003-12-27 1 >> >> # Convert dates to integers >> >> year1_i[[2]] <- as.integer(as.Date(yeari_1[[2]])); yeari_2[[2]] <- >> as.integer(as.Date(yeari_2[[2]])); >> >>> yeari_1 >> A date category >> 7 117 13589 1 >> 10 173 13022 1 >>> yeari_2 >> A date category >> 3 137 12063 1 >> 8 19 12105 1 >> 9 80 11174 1 >> 10 94 12413 1 >> >> # Get differences of all pairs: >> >>> result <- outer(yeari_1[[2]],yeari_2[[2]],'-') >>> result >> [,1] [,2] [,3] [,4] >> [1,] 1526 1484 2415 1176 >> [2,] 959 917 1848 609 >> >> # Now, merge the results with the results from all the earlier >> iterations for previous values of i, increment i to the next value, >> and repeat. >> >> >> ---- >> >> Ideally, I could accomplish this in some sort of vectorized manner, >> although the Force is not yet strong with me. Any ideas would be >> appreciated! >> >> >> Regards, >> >> Jonathan >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.