You could also use something more oriented at timeseries data like
https://github.com/rackerlabs/blueflood/. Then you'd have to write some
output adapters to feed the additional processing of your data elsewhere.
 I think the team is working on making an output adapter for Kafka for the
rolled-up metrics (5m, 20m, 60m, etc). It has the capability to re-emit
data when some arrives late.

-Dan


On Wed, Aug 28, 2013 at 7:55 AM, Travis Brady <travis.br...@gmail.com>wrote:

> This is a very common problem in my experience.  Late-arriving and
> semi-ordered data make a lot of stream processing problems more difficult.
>
> Are you able to perform analysis with part of the data?  For instance
> buffering some number of events and then analyze?
> How exactly do you know definitively that you've received *everything* for
> some time window?
>
> Here's what I do (using Storm+Kafka+Redshift):
>
> A Storm topo reads tuples from a Kafka topic and aggregates them using a
> Guava Cache (which provides automatic time and size-based eviction)
> At its simplest the key in the cache is the minute as a unix timestamp and
> the value is a count of events for that time window.  In more complex cases
> the key is a composite data type and the value might be a StreamSummary or
> HyperLogLog class from streamlib.  Anyway, I configure the Cache to evict
> entries once they've gone untouched for 30 seconds.
>
> On eviction the data flows to S3 and from there to Redshift and that is
> where we're able to get our canonical answer, because even if a certain
> minute is composed of many records (due to data arriving late) in Redshift,
> we just aggregate over those records.
>
> You may want to look at algebird, its mostly un-documented but provides a
> lot of nice primitives for doing streaming aggregation.
>
> Good luck.
>
>
> On Wed, Aug 28, 2013 at 8:13 AM, Philip O'Toole <phi...@loggly.com> wrote:
>
> > Well, you can only store data in Kafka, you can't put application logic
> in
> > there.
> >
> > Storm is good for processing data, but it is not a data store, so that is
> > out. Redis might work, but it is only an in-memory store (seems like it
> > does have persistence, but I don't know much about that).
> >
> > You could try using Kafka and Storm to write the data to something like
> > Cassandra or Elasticsearch, and perform your analysis later on the data
> set
> > as it lives in there.
> >
> > Philip
> >
> > On Aug 28, 2013, at 5:10 AM, Yavar Husain <yavarhus...@gmail.com> wrote:
> >
> > > I have an application where I will be getting some Time Series data
> > which I
> > > am feeding to Kafka and Kafka in turn is giving data to Storm for doing
> > > some real time processing.
> > >
> > > Now one of my use case is that there might be certain lag in my data.
> For
> > > an example: I might not get all the data for 2:00:00 PM all together.
> > There
> > > is a possibility that say all the data for 2:00:00 PM does not arrive
> at
> > a
> > > time and the application has to wait for all the data to arrive to
> > perform
> > > certain analytics.
> > >
> > > For example, say at 2:00:00 pm I get 990 points and another 10 points
> > (say
> > > I know beforehand that there would be 1000 points of data per
> > millisecond)
> > > arrive at 2:00:40 PM. Now I have to wait for all the data to arrive to
> > > perform analytics.
> > >
> > > Where should I place my application logic: (1) In Kafka, (2) In Storm
> or
> > > should I use something like Redis to get all the timestamped data and
> > when
> > > I get all the points for a particular time than only I give it to
> > > Kafka/Storm.
> > >
> > > I am confused :) Any help would be appreciated. Sorry for any
> grammatical
> > > errors as I just was thinking aloud and jotting down my question.
> > >
> > > Regards,
> > > Yavar
> >
>



-- 
Dan Di Spaltro

Reply via email to