Maybe Storm is what you are looking for (as well as flume to get the messages from the network) http://www.datastax.com/events/cassandranyc2011/presentations/marz
Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 22/02/2012, at 2:23 AM, Alain RODRIGUEZ wrote: > Thanks for answering. > > "This is a good starting point > https://github.com/thobbs/flume-cassandra-plugin " > > I already saw that, but it only does a raw store of the logs. I would like > too store them in a "smart way", I mean I'd like to store logs to be able to > use information contained into them. > > If I have rows like : (date action/event/id_ad/id_transac) > > 1 - 2012-02-17 18:22:09 track/display/4/70 > 2 - 2012-02-17 18:22:09 track/display/2/70 > 3 - 2012-02-17 18:22:09 track/display/3/70 > 4 - 2012-02-17 18:22:29 track/start/3/70 > 5 - 2012-02-17 18:22:39 track/firstQuartile/3/70 > 6 - 2012-02-17 18:22:46 track/midpoint/3/70 > 7 - 2012-02-17 18:22:53 track/complete/3/70 > 8 - 2012-02-17 18:23:02 track/click/3/70 > > I would like to process this logs to store in cassandra : > > 1 - increment the display counter for the ad 4, find the transac with id "70" > in my database to get the id_product (let's say it's 19) and then increment > the display counter for product 19. I would also store a raw data like > event1: (event => display, ad => 4, transac => 70 ...) > > 2 - ... > ... > > 7 - ... > > 8 - increment the click counter for the ad 3, find the transac with id "70" > in my database to get the id_product (let's say it's 19) and then increment > the click counter for product 19. I would also store a raw data like event8 > : (event => click, ad => 3, transac => 70 ...) and update the status of the > transaction to a "finish" state. > > I want a really custom behaviour, so I guess I'll have to build a specific > flume sink (or is there a generic and configurable sink existing somewhere ?). > > Maybe should I use the flume-cassandra-plugin and process the data once > already stored rawly ? In this case, how to be sure that I have proceed all > the data and how to be sure doing it in real-time or near real-time ? Is this > performant ? > > I hope you'll understand what I just wrote, it's not very simple, and I'm not > fluent in English. Don't hesitate asking for more explanation. > > The final goal of all this is to have statistics in near real-time, on the > same cluster than the OLTP which is critical to us. The real-time statistics > have to be slowed (and become near real-time stats) when we are in rush hours > in order to be fully performant in the business part. > > Alain > > 2012/2/10 aaron morton <aa...@thelastpickle.com> >> How to do it ? Do I need to build a custom plugin/sink or can I configure an >> existing sink to write data in a custom way ? > This is a good starting point https://github.com/thobbs/flume-cassandra-plugin > >> 2 - My business process also use my Cassandra DB (without flume, directly >> via thrift), how to ensure that log writing won't overload my database and >> introduce latency in my business process ? > Anytime you have a data stream you don't control it's a good idea to put some > sort of buffer in there between the outside world and the database. Flume has > a buffered sync, I think your can subclass it and aggregate the counters for > a minute or two > http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics > > Hope that helps. > A > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote: > >> Hi, >> >> 1 - I would like to generate some statistics and store some raw events from >> log files tailed with flume. I saw some plugins giving Cassandra sinks but I >> would like to store data in a custom way, storing raw data but also >> incrementing counters to get near real-time statistcis. How to do it ? Do I >> need to build a custom plugin/sink or can I configure an existing sink to >> write data in a custom way ? >> >> 2 - My business process also use my Cassandra DB (without flume, directly >> via thrift), how to ensure that log writing won't overload my database and >> introduce latency in my business process ? I mean, is there a way to to >> manage the throughput sent by the flume's tails and slow them when my >> Cassandra cluster is overloaded ? I would like to avoid building 2 separated >> clusters. >> >> Thank you, >> >> Alain >> > >