One other option is to use something like Druid, especially if you care about doing arbitrary dimensional drilldowns.
http://druid.io It reads from Kafka and can do simple rollups for you automatically (meaning you don't need storm if all you are doing with Storm is a simple "group by" style rollup). If you need to do real-time joins of various streams, it's fairly easy to use Storm to do that and push the data into Druid as well. Druid handles the delayed messages issue by allowing for a configurable time window in which messages can be delayed (we run it with a 10 minute window). Using Druid would be similar to Travis's setup, except it would allow you to ingest the data in real-time and query the data as it is being ingested, instead of having to wait for the persist to s3 and load into Redshift. Also, I'm biased about Druid ;). --Eric On Fri, Aug 30, 2013 at 5:47 PM, Dan Di Spaltro <dan.dispal...@gmail.com>wrote: > You could also use something more oriented at timeseries data like > https://github.com/rackerlabs/blueflood/. Then you'd have to write some > output adapters to feed the additional processing of your data elsewhere. > I think the team is working on making an output adapter for Kafka for the > rolled-up metrics (5m, 20m, 60m, etc). It has the capability to re-emit > data when some arrives late. > > -Dan > > > On Wed, Aug 28, 2013 at 7:55 AM, Travis Brady <travis.br...@gmail.com > >wrote: > > > This is a very common problem in my experience. Late-arriving and > > semi-ordered data make a lot of stream processing problems more > difficult. > > > > Are you able to perform analysis with part of the data? For instance > > buffering some number of events and then analyze? > > How exactly do you know definitively that you've received *everything* > for > > some time window? > > > > Here's what I do (using Storm+Kafka+Redshift): > > > > A Storm topo reads tuples from a Kafka topic and aggregates them using a > > Guava Cache (which provides automatic time and size-based eviction) > > At its simplest the key in the cache is the minute as a unix timestamp > and > > the value is a count of events for that time window. In more complex > cases > > the key is a composite data type and the value might be a StreamSummary > or > > HyperLogLog class from streamlib. Anyway, I configure the Cache to evict > > entries once they've gone untouched for 30 seconds. > > > > On eviction the data flows to S3 and from there to Redshift and that is > > where we're able to get our canonical answer, because even if a certain > > minute is composed of many records (due to data arriving late) in > Redshift, > > we just aggregate over those records. > > > > You may want to look at algebird, its mostly un-documented but provides a > > lot of nice primitives for doing streaming aggregation. > > > > Good luck. > > > > > > On Wed, Aug 28, 2013 at 8:13 AM, Philip O'Toole <phi...@loggly.com> > wrote: > > > > > Well, you can only store data in Kafka, you can't put application logic > > in > > > there. > > > > > > Storm is good for processing data, but it is not a data store, so that > is > > > out. Redis might work, but it is only an in-memory store (seems like it > > > does have persistence, but I don't know much about that). > > > > > > You could try using Kafka and Storm to write the data to something like > > > Cassandra or Elasticsearch, and perform your analysis later on the data > > set > > > as it lives in there. > > > > > > Philip > > > > > > On Aug 28, 2013, at 5:10 AM, Yavar Husain <yavarhus...@gmail.com> > wrote: > > > > > > > I have an application where I will be getting some Time Series data > > > which I > > > > am feeding to Kafka and Kafka in turn is giving data to Storm for > doing > > > > some real time processing. > > > > > > > > Now one of my use case is that there might be certain lag in my data. > > For > > > > an example: I might not get all the data for 2:00:00 PM all together. > > > There > > > > is a possibility that say all the data for 2:00:00 PM does not arrive > > at > > > a > > > > time and the application has to wait for all the data to arrive to > > > perform > > > > certain analytics. > > > > > > > > For example, say at 2:00:00 pm I get 990 points and another 10 points > > > (say > > > > I know beforehand that there would be 1000 points of data per > > > millisecond) > > > > arrive at 2:00:40 PM. Now I have to wait for all the data to arrive > to > > > > perform analytics. > > > > > > > > Where should I place my application logic: (1) In Kafka, (2) In Storm > > or > > > > should I use something like Redis to get all the timestamped data and > > > when > > > > I get all the points for a particular time than only I give it to > > > > Kafka/Storm. > > > > > > > > I am confused :) Any help would be appreciated. Sorry for any > > grammatical > > > > errors as I just was thinking aloud and jotting down my question. > > > > > > > > Regards, > > > > Yavar > > > > > > > > > -- > Dan Di Spaltro >