Re: [R] Calculate average of many subsets based on columns in another dataframe

Peter Lomas Thu, 11 Feb 2016 10:39:48 -0800

Thanks to everybody for trying to help me with this, I think there are
a few workable options here.  However, I think the most efficient
option that I've found was to avoid the join/aggregate in R
altogether.  I've joined them at the database level to accomplish the
same thing.  This may not be a helpful solution for everybody, but
it's an option to be aware of for those seeking efficiency perhaps.


In oracle this would be something like:

select
  groups.rangeStart,
  groups.rangeEnd,
  avg(observations.values)
from groups, observations
where observations.date between groups.rangeStart and groups.rangeEnd
group by rangeStart, rangeEnd

On Wed, Feb 10, 2016 at 4:32 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> Oh, you didn't say the intervals could overlap!
>
> If Bill D's suggestions don't suffice, try the following:
>
> (again assuming all dates are in a form that allow comparison
> operations, e.g. via as.POSIX**)
>
> Assume you have g intervals with start dates "starts" and end dates
> "ends" and that you have d "dates".
>
> Then:
>
> wh <- outer(starts, dates,,"<=") & outer(ends,dates,">") ## arrange
> "<" and ">"   as you wish
>
> ## (?outer for details.)
>
> is a d x g matrix with each column's TRUE values giving the rows =
> dates contained in that column's interval.
>
> Then if "somedat" is a data vector of length d
>
> apply(wh, 2, function(x) mean(somedat[x]) )
>
> will give you the means for each interval of all somedat values whose
> dates  fell into that interval.
> This last step can be repeated for as many somedats as you like.
>
> Note that this is still a loop (via apply), however, so it may not
> satisfy your efficiency needs.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Feb 10, 2016 at 2:18 PM, Peter Lomas <peter.br.lo...@gmail.com> wrote:
>> Thanks David, Bert,
>>
>> From what I'm reading on ?findInterval, It may not be workable because
>> of overlapping date ranges.  findInterval seems to take a series of
>> bin breakpoints as its argument. I'm currently exploring data.table
>> documentation and will keep thinking about this.
>>
>> Just on David's point, the extension of this with "groups" would look
>> as below.  I just don't want to complicate it before I've solved the
>> simplest issue.
>>
>> set.seed(345)
>> date.range <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
>> by="DSTday")
>> observations <- data.frame(date=date.range, a=runif(152,1,100),
>> b=runif(152,1,100), c=runif(152,1,100) )
>> groups <- data.frame(start=sample(date.range[1:50], 20), end =
>> sample(date.range[51:152], 20), group=sample(letters[1:3], 20,
>> replace=TRUE), average = NA)
>>
>> #Potential Solutions (too inefficient)
>> for(i in 1:NROW(groups)){
>>  groups[i, "average"] <- mean(observations[observations$date >=
>> groups[i, "start"] & observations$date <=groups[i, "end"],
>> as.character(groups[i, "group"])])
>> }
>>
>> Thanks again,
>> Peter
>>
>> On Wed, Feb 10, 2016 at 2:26 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
>>> A strategy:
>>>
>>>  1. Convert your dates and intervals to numerics that give the days
>>> since a time origin. See as.POSIXlt (or ** ct for details and an
>>> example that does this). Should be fast...
>>>
>>> 2. Use the findInterval() function to get the interval into which each
>>> date falls. This **is** "vectorized" and should be fairly fast.
>>>
>>> 3. Use the ave() function using the intervals as your factor that
>>> splits your data column(s) for which you wish to compute statistics.
>>> The basic statistics functions like mean, sum, etc. **are**
>>> vectorized, so this should be fast.
>>>
>>> As David said, the *apply functions will probably not be much, if at
>>> all, faster than an explicit for() loop. Most of the time will be
>>> spent spent comparing the dates to the intervals to find in which each
>>> falls, and findInterval is a fast way to do this.
>>>
>>> ... I think.
>>>
>>> If you try this, let me know (perhaps privately) how/if it works.
>>>
>>> Cheers,
>>> Bert
>>> Bert Gunter
>>>
>>> "The trouble with having an open mind is that people keep coming along
>>> and sticking things into it."
>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>
>>>
>>> On Wed, Feb 10, 2016 at 1:08 PM, David Winsemius <dwinsem...@comcast.net> 
>>> wrote:
>>>>
>>>>> On Feb 10, 2016, at 12:18 PM, Peter Lomas <peter.br.lo...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>> Hello, I have a dataframe with a date range, and another dataframe
>>>>> with observations by date.  For each date range, I'd like to average
>>>>> the values within that range from the other dataframe.  I've provided
>>>>> code below doing what I would like, but using a for loop is too
>>>>> inefficient for my actual case (takes about an hour).  So I'm looking
>>>>> for a way to vectorize.
>>>>>
>>>>>
>>>>> set.seed(345)
>>>>> date.range <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-06-01"),
>>>>> by="DSTday")
>>>>> observations <- data.frame(date=date.range, values=runif(152,1,100) )
>>>>> groups <- data.frame(start=sample(date.range[1:50], 20), end =
>>>>> sample(date.range[51:152], 20), average = NA)
>>>>>
>>>>> #Potential Solution (too inefficient)
>>>>>
>>>>> for(i in 1:NROW(groups)){
>>>>> groups[i, "average"] <- mean(observations[observations$date >=
>>>>> groups[i, "start"] & observations$date <=groups[i, "end"], "values"])
>>>>> }
>>>>>
>>>> The 'average' column could be added to groups with this value:
>>>>
>>>> mapply( function(start,end){ mean(observations[['values']][
>>>>                      observations$date >= start & observations$date 
>>>> <=end])},
>>>>         groups$start, groups$end)
>>>>
>>>>  [1] 50.96831 49.42286 47.27240 49.07534 47.66570 49.30977 48.47503 
>>>> 47.74036
>>>>  [9] 46.02527 58.76492 48.86580 49.90655 45.79705 48.84071 39.53846 
>>>> 46.44601
>>>> [17] 47.06631 47.74199 49.16980 46.85131
>>>>
>>>> I don't really think this is fully "vectorized" in the usual R-meaning of 
>>>> the word. And I don't expect it to be any faster than the for-loop. 
>>>> Perhaps some of the range functions in the data.table package could 
>>>> accelerate your processing. If you don't get any volunteers in this list, 
>>>> you could repost the question on StackOverflow after a suitable pause that 
>>>> avoids accusations of cross-posting. SO has several skilled users of 
>>>> data.table functions.
>>>>
>>>>> As an extension to this, there will end up being multiple value
>>>>> columns, and each range will also identify which column to average.  I
>>>>> think if I can figure out the first problem I can try to extend it
>>>>> myself.
>>>>
>>>> Sorry, I didn't understand what was being described in that paragraph.
>>>>
>>>> --
>>>>
>>>> David Winsemius
>>>> Alameda, CA, USA
>>>>
>>>> ______________________________________________
>>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Calculate average of many subsets based on columns in another dataframe

Reply via email to