Thanks to everybody for trying to help me with this, I think there are a few workable options here. However, I think the most efficient option that I've found was to avoid the join/aggregate in R altogether. I've joined them at the database level to accomplish the same thing. This may not be a helpful solution for everybody, but it's an option to be aware of for those seeking efficiency perhaps.
In oracle this would be something like: select groups.rangeStart, groups.rangeEnd, avg(observations.values) from groups, observations where observations.date between groups.rangeStart and groups.rangeEnd group by rangeStart, rangeEnd On Wed, Feb 10, 2016 at 4:32 PM, Bert Gunter <bgunter.4...@gmail.com> wrote: > Oh, you didn't say the intervals could overlap! > > If Bill D's suggestions don't suffice, try the following: > > (again assuming all dates are in a form that allow comparison > operations, e.g. via as.POSIX**) > > Assume you have g intervals with start dates "starts" and end dates > "ends" and that you have d "dates". > > Then: > > wh <- outer(starts, dates,,"<=") & outer(ends,dates,">") ## arrange > "<" and ">" as you wish > > ## (?outer for details.) > > is a d x g matrix with each column's TRUE values giving the rows = > dates contained in that column's interval. > > Then if "somedat" is a data vector of length d > > apply(wh, 2, function(x) mean(somedat[x]) ) > > will give you the means for each interval of all somedat values whose > dates fell into that interval. > This last step can be repeated for as many somedats as you like. > > Note that this is still a loop (via apply), however, so it may not > satisfy your efficiency needs. > > Cheers, > Bert > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Wed, Feb 10, 2016 at 2:18 PM, Peter Lomas <peter.br.lo...@gmail.com> wrote: >> Thanks David, Bert, >> >> From what I'm reading on ?findInterval, It may not be workable because >> of overlapping date ranges. findInterval seems to take a series of >> bin breakpoints as its argument. I'm currently exploring data.table >> documentation and will keep thinking about this. >> >> Just on David's point, the extension of this with "groups" would look >> as below. I just don't want to complicate it before I've solved the >> simplest issue. >> >> set.seed(345) >> date.range <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"), >> by="DSTday") >> observations <- data.frame(date=date.range, a=runif(152,1,100), >> b=runif(152,1,100), c=runif(152,1,100) ) >> groups <- data.frame(start=sample(date.range[1:50], 20), end = >> sample(date.range[51:152], 20), group=sample(letters[1:3], 20, >> replace=TRUE), average = NA) >> >> #Potential Solutions (too inefficient) >> for(i in 1:NROW(groups)){ >> groups[i, "average"] <- mean(observations[observations$date >= >> groups[i, "start"] & observations$date <=groups[i, "end"], >> as.character(groups[i, "group"])]) >> } >> >> Thanks again, >> Peter >> >> On Wed, Feb 10, 2016 at 2:26 PM, Bert Gunter <bgunter.4...@gmail.com> wrote: >>> A strategy: >>> >>> 1. Convert your dates and intervals to numerics that give the days >>> since a time origin. See as.POSIXlt (or ** ct for details and an >>> example that does this). Should be fast... >>> >>> 2. Use the findInterval() function to get the interval into which each >>> date falls. This **is** "vectorized" and should be fairly fast. >>> >>> 3. Use the ave() function using the intervals as your factor that >>> splits your data column(s) for which you wish to compute statistics. >>> The basic statistics functions like mean, sum, etc. **are** >>> vectorized, so this should be fast. >>> >>> As David said, the *apply functions will probably not be much, if at >>> all, faster than an explicit for() loop. Most of the time will be >>> spent spent comparing the dates to the intervals to find in which each >>> falls, and findInterval is a fast way to do this. >>> >>> ... I think. >>> >>> If you try this, let me know (perhaps privately) how/if it works. >>> >>> Cheers, >>> Bert >>> Bert Gunter >>> >>> "The trouble with having an open mind is that people keep coming along >>> and sticking things into it." >>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >>> >>> >>> On Wed, Feb 10, 2016 at 1:08 PM, David Winsemius <dwinsem...@comcast.net> >>> wrote: >>>> >>>>> On Feb 10, 2016, at 12:18 PM, Peter Lomas <peter.br.lo...@gmail.com> >>>>> wrote: >>>>> >>>>> Hello, I have a dataframe with a date range, and another dataframe >>>>> with observations by date. For each date range, I'd like to average >>>>> the values within that range from the other dataframe. I've provided >>>>> code below doing what I would like, but using a for loop is too >>>>> inefficient for my actual case (takes about an hour). So I'm looking >>>>> for a way to vectorize. >>>>> >>>>> >>>>> set.seed(345) >>>>> date.range <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"), >>>>> by="DSTday") >>>>> observations <- data.frame(date=date.range, values=runif(152,1,100) ) >>>>> groups <- data.frame(start=sample(date.range[1:50], 20), end = >>>>> sample(date.range[51:152], 20), average = NA) >>>>> >>>>> #Potential Solution (too inefficient) >>>>> >>>>> for(i in 1:NROW(groups)){ >>>>> groups[i, "average"] <- mean(observations[observations$date >= >>>>> groups[i, "start"] & observations$date <=groups[i, "end"], "values"]) >>>>> } >>>>> >>>> The 'average' column could be added to groups with this value: >>>> >>>> mapply( function(start,end){ mean(observations[['values']][ >>>> observations$date >= start & observations$date >>>> <=end])}, >>>> groups$start, groups$end) >>>> >>>> [1] 50.96831 49.42286 47.27240 49.07534 47.66570 49.30977 48.47503 >>>> 47.74036 >>>> [9] 46.02527 58.76492 48.86580 49.90655 45.79705 48.84071 39.53846 >>>> 46.44601 >>>> [17] 47.06631 47.74199 49.16980 46.85131 >>>> >>>> I don't really think this is fully "vectorized" in the usual R-meaning of >>>> the word. And I don't expect it to be any faster than the for-loop. >>>> Perhaps some of the range functions in the data.table package could >>>> accelerate your processing. If you don't get any volunteers in this list, >>>> you could repost the question on StackOverflow after a suitable pause that >>>> avoids accusations of cross-posting. SO has several skilled users of >>>> data.table functions. >>>> >>>>> As an extension to this, there will end up being multiple value >>>>> columns, and each range will also identify which column to average. I >>>>> think if I can figure out the first problem I can try to extend it >>>>> myself. >>>> >>>> Sorry, I didn't understand what was being described in that paragraph. >>>> >>>> -- >>>> >>>> David Winsemius >>>> Alameda, CA, USA >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.