There are different ways to do the rollups. Either update rollups from the streaming application, or you can generate roll ups in a later process - say periodic Spark job every hour. Or you could just generate rollups on demand, when it is queried. The whole thing depends on your downstream requirements - if you always to have up to date rollups to show up in dashboard (even day-level stuff), then the first approach is better. Otherwise, second and third approaches are more efficient.
TD On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta <[email protected]> wrote: > TD thank you for your reply. > > I agree on data store requirement. I am using HBase as an underlying store. > > So for every batch interval of say 10 seconds > > - Calculate the time dimension ( minutes, hours, day, week, month and > quarter ) along with other dimensions and metrics > - Update relevant base table at each batch interval for relevant metrics > for a given set of dimensions. > > Only caveat I see is I’ll have to update each of the different roll up > table for each batch window. > > Is this a valid approach for calculating time series aggregation? > > Regards > SM > > For minutes level aggregates I have set up a streaming window say 10 > seconds and storing minutes level aggregates across multiple dimension in > HBase at every window interval. > > On 18-Nov-2015, at 7:45 AM, Tathagata Das <[email protected]> wrote: > > For this sort of long term aggregations you should use a dedicated data > storage systems. Like a database, or a key-value store. Spark Streaming > would just aggregate and push the necessary data to the data store. > > TD > > On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta <[email protected]> > wrote: > >> Hi, >> >> I am working on requirement of calculating real time metrics and building >> prototype on Spark streaming. I need to build aggregate at Seconds, >> Minutes, Hours and Day level. >> >> I am not sure whether I should calculate all these aggregates as >> different Windowed function on input DStream or shall I use >> updateStateByKey function for the same. If I have to use updateStateByKey >> for these time series aggregation, how can I remove keys from the state >> after different time lapsed? >> >> Please suggest. >> >> Regards >> SM >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > >
