Re: Calculating Timeseries Aggregation

Tathagata Das Wed, 18 Nov 2015 17:29:18 -0800

There are different ways to do the rollups. Either update rollups from the
streaming application, or you can generate roll ups in a later process -
say periodic Spark job every hour. Or you could just generate rollups on
demand, when it is queried.
The whole thing depends on your downstream requirements - if you always to
have up to date rollups to show up in dashboard (even day-level stuff),
then the first approach is better. Otherwise, second and third approaches
are more efficient.


TD


On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta <[email protected]>
wrote:

> TD thank you for your reply.
>
> I agree on data store requirement. I am using HBase as an underlying store.
>
> So for every batch interval of say 10 seconds
>
> - Calculate the time dimension ( minutes, hours, day, week, month and
> quarter ) along with other dimensions and metrics
> - Update relevant base table at each batch interval for relevant metrics
> for a given set of dimensions.
>
> Only caveat I see is I’ll have to update each of the different roll up
> table for each batch window.
>
> Is this a valid approach for calculating time series aggregation?
>
> Regards
> SM
>
> For minutes level aggregates I have set up a streaming window say 10
> seconds and storing minutes level aggregates across multiple dimension in
> HBase at every window interval.
>
> On 18-Nov-2015, at 7:45 AM, Tathagata Das <[email protected]> wrote:
>
> For this sort of long term aggregations you should use a dedicated data
> storage systems. Like a database, or a key-value store. Spark Streaming
> would just aggregate and push the necessary data to the data store.
>
> TD
>
> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta <[email protected]>
> wrote:
>
>> Hi,
>>
>> I am working on requirement of calculating real time metrics and building
>> prototype  on Spark streaming. I need to build aggregate at Seconds,
>> Minutes, Hours and Day level.
>>
>> I am not sure whether I should calculate all these aggregates as
>> different Windowed function on input DStream or shall I use
>> updateStateByKey function for the same. If I have to use updateStateByKey
>> for these time series aggregation, how can I remove keys from the state
>> after different time lapsed?
>>
>> Please suggest.
>>
>> Regards
>> SM
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>

Re: Calculating Timeseries Aggregation

Reply via email to