A faster solution using tapply was sent to me via email: testtapply = function(p){ df = randomdf(p) system.time({res = tapply(df$x2,df$x1,min); res = as.Date(res,origin=as.Date('1970-01-01')); df$mindate = res[as.character(df$x1)]}) }
Thanks Phil! Tahir On Thu, Nov 19, 2009 at 5:19 PM, Tahir Butt <tahir.b...@gmail.com> wrote: > I've only recently started using R. One of the problems I come up > against is after having extracted a large dataset (>5M rows) out of > database, I realize I need another variable. In this case I have data > frame with dates. I want to find the minimum date for each value of x1 > and add that minimum date to my data.frame. > >> randomdf <- function(p) { > data.frame(x1=sample(1:10^4, 10^p, replace=T), > x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by="day"), 10^p, replace=T), > y1=sample(1:100, 10^p, replace=T)) > } >> testby <- function(p) { > df <- randomdf(p) > system.time(by(df, df$x1, function(dfi) { min(dfi$x2) })) > } >> lapply(c(1,2,3,4,5), testby) > [[1]] > user system elapsed > 0.006 0.000 0.006 > > [[2]] > user system elapsed > 0.024 0.000 0.025 > > [[3]] > user system elapsed > 0.233 0.000 0.234 > > [[4]] > user system elapsed > 1.996 0.026 2.022 > > [[5]] > user system elapsed > 11.030 0.000 11.032 > > Strangely enough, not sure why this is, the result of by with the min > function is not date objects but instead integers representing days > from an origin. Is there a min function that would return me a date > instead of an integer? Or is this a result of using by? > > I also wanted to see how ddply compares. > >> testddply <- function(p) { pdf <- randomdf(p); system.time(ddply(pdf, .(x1), >> function(df) { return (data.frame(min(df$x2))) })) } >> lapply(c(1,2,3,4,5), testddply) > [[1]] > user system elapsed > 0.020 0.000 0.021 > > [[2]] > user system elapsed > 0.119 0.000 0.119 > > [[3]] > user system elapsed > 1.008 0.000 1.008 > > [[4]] > user system elapsed > 8.425 0.001 8.428 > > [[5]] > user system elapsed > 23.070 0.000 23.075 > > Once the data frame gets above 1M rows, the timings are a bit too long > (on a previous run it went up to 8000s user time). This seems quite a > bit slower than I expected. Maybe there's a better and faster way to > add such variables to a data frame that are derived using some > aggregation. > > Also, ddply seems to take twice as long as by. Are these two > operations not equivalent? > > Thanks, > Tahir > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.