You could also use something more oriented at timeseries data like https://github.com/rackerlabs/blueflood/. Then you'd have to write some output adapters to feed the additional processing of your data elsewhere. I think the team is working on making an output adapter for Kafka for the rolled-up metrics (5m, 20m, 60m, etc). It has the capability to re-emit data when some arrives late.
-Dan On Wed, Aug 28, 2013 at 7:55 AM, Travis Brady <travis.br...@gmail.com>wrote: > This is a very common problem in my experience. Late-arriving and > semi-ordered data make a lot of stream processing problems more difficult. > > Are you able to perform analysis with part of the data? For instance > buffering some number of events and then analyze? > How exactly do you know definitively that you've received *everything* for > some time window? > > Here's what I do (using Storm+Kafka+Redshift): > > A Storm topo reads tuples from a Kafka topic and aggregates them using a > Guava Cache (which provides automatic time and size-based eviction) > At its simplest the key in the cache is the minute as a unix timestamp and > the value is a count of events for that time window. In more complex cases > the key is a composite data type and the value might be a StreamSummary or > HyperLogLog class from streamlib. Anyway, I configure the Cache to evict > entries once they've gone untouched for 30 seconds. > > On eviction the data flows to S3 and from there to Redshift and that is > where we're able to get our canonical answer, because even if a certain > minute is composed of many records (due to data arriving late) in Redshift, > we just aggregate over those records. > > You may want to look at algebird, its mostly un-documented but provides a > lot of nice primitives for doing streaming aggregation. > > Good luck. > > > On Wed, Aug 28, 2013 at 8:13 AM, Philip O'Toole <phi...@loggly.com> wrote: > > > Well, you can only store data in Kafka, you can't put application logic > in > > there. > > > > Storm is good for processing data, but it is not a data store, so that is > > out. Redis might work, but it is only an in-memory store (seems like it > > does have persistence, but I don't know much about that). > > > > You could try using Kafka and Storm to write the data to something like > > Cassandra or Elasticsearch, and perform your analysis later on the data > set > > as it lives in there. > > > > Philip > > > > On Aug 28, 2013, at 5:10 AM, Yavar Husain <yavarhus...@gmail.com> wrote: > > > > > I have an application where I will be getting some Time Series data > > which I > > > am feeding to Kafka and Kafka in turn is giving data to Storm for doing > > > some real time processing. > > > > > > Now one of my use case is that there might be certain lag in my data. > For > > > an example: I might not get all the data for 2:00:00 PM all together. > > There > > > is a possibility that say all the data for 2:00:00 PM does not arrive > at > > a > > > time and the application has to wait for all the data to arrive to > > perform > > > certain analytics. > > > > > > For example, say at 2:00:00 pm I get 990 points and another 10 points > > (say > > > I know beforehand that there would be 1000 points of data per > > millisecond) > > > arrive at 2:00:40 PM. Now I have to wait for all the data to arrive to > > > perform analytics. > > > > > > Where should I place my application logic: (1) In Kafka, (2) In Storm > or > > > should I use something like Redis to get all the timestamped data and > > when > > > I get all the points for a particular time than only I give it to > > > Kafka/Storm. > > > > > > I am confused :) Any help would be appreciated. Sorry for any > grammatical > > > errors as I just was thinking aloud and jotting down my question. > > > > > > Regards, > > > Yavar > > > -- Dan Di Spaltro