This is a very common problem in my experience.  Late-arriving and
semi-ordered data make a lot of stream processing problems more difficult.

Are you able to perform analysis with part of the data?  For instance
buffering some number of events and then analyze?
How exactly do you know definitively that you've received *everything* for
some time window?

Here's what I do (using Storm+Kafka+Redshift):

A Storm topo reads tuples from a Kafka topic and aggregates them using a
Guava Cache (which provides automatic time and size-based eviction)
At its simplest the key in the cache is the minute as a unix timestamp and
the value is a count of events for that time window.  In more complex cases
the key is a composite data type and the value might be a StreamSummary or
HyperLogLog class from streamlib.  Anyway, I configure the Cache to evict
entries once they've gone untouched for 30 seconds.

On eviction the data flows to S3 and from there to Redshift and that is
where we're able to get our canonical answer, because even if a certain
minute is composed of many records (due to data arriving late) in Redshift,
we just aggregate over those records.

You may want to look at algebird, its mostly un-documented but provides a
lot of nice primitives for doing streaming aggregation.

Good luck.


On Wed, Aug 28, 2013 at 8:13 AM, Philip O'Toole <phi...@loggly.com> wrote:

> Well, you can only store data in Kafka, you can't put application logic in
> there.
>
> Storm is good for processing data, but it is not a data store, so that is
> out. Redis might work, but it is only an in-memory store (seems like it
> does have persistence, but I don't know much about that).
>
> You could try using Kafka and Storm to write the data to something like
> Cassandra or Elasticsearch, and perform your analysis later on the data set
> as it lives in there.
>
> Philip
>
> On Aug 28, 2013, at 5:10 AM, Yavar Husain <yavarhus...@gmail.com> wrote:
>
> > I have an application where I will be getting some Time Series data
> which I
> > am feeding to Kafka and Kafka in turn is giving data to Storm for doing
> > some real time processing.
> >
> > Now one of my use case is that there might be certain lag in my data. For
> > an example: I might not get all the data for 2:00:00 PM all together.
> There
> > is a possibility that say all the data for 2:00:00 PM does not arrive at
> a
> > time and the application has to wait for all the data to arrive to
> perform
> > certain analytics.
> >
> > For example, say at 2:00:00 pm I get 990 points and another 10 points
> (say
> > I know beforehand that there would be 1000 points of data per
> millisecond)
> > arrive at 2:00:40 PM. Now I have to wait for all the data to arrive to
> > perform analytics.
> >
> > Where should I place my application logic: (1) In Kafka, (2) In Storm or
> > should I use something like Redis to get all the timestamped data and
> when
> > I get all the points for a particular time than only I give it to
> > Kafka/Storm.
> >
> > I am confused :) Any help would be appreciated. Sorry for any grammatical
> > errors as I just was thinking aloud and jotting down my question.
> >
> > Regards,
> > Yavar
>

Reply via email to