Maybe Storm is what you are looking for (as well as flume to get the messages 
from the network)
http://www.datastax.com/events/cassandranyc2011/presentations/marz

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/02/2012, at 2:23 AM, Alain RODRIGUEZ wrote:

> Thanks for answering.
> 
> "This is a good starting point 
> https://github.com/thobbs/flume-cassandra-plugin "
> 
> I already saw that, but it only does a raw store of the logs. I would like 
> too store them in a "smart way", I mean I'd like to store logs to be able to 
> use information contained into them.
> 
> If I have rows like : (date action/event/id_ad/id_transac)
> 
> 1 - 2012-02-17 18:22:09 track/display/4/70
> 2 - 2012-02-17 18:22:09 track/display/2/70
> 3 - 2012-02-17 18:22:09 track/display/3/70
> 4 - 2012-02-17 18:22:29 track/start/3/70
> 5 - 2012-02-17 18:22:39 track/firstQuartile/3/70
> 6 - 2012-02-17 18:22:46 track/midpoint/3/70
> 7 - 2012-02-17 18:22:53 track/complete/3/70
> 8 - 2012-02-17 18:23:02 track/click/3/70
> 
> I would like to process this logs to store in cassandra :
> 
> 1 - increment the display counter for the ad 4, find the transac with id "70" 
> in my database to get the id_product (let's say it's 19) and then increment 
> the display counter for product 19. I would also store a raw data like 
> event1: (event => display, ad => 4, transac => 70 ...)
> 
> 2 - ...
> ...
> 
> 7 - ...
> 
> 8 - increment the click counter for the ad 3, find the transac with id "70" 
> in my database to get the id_product (let's say it's 19) and then increment 
> the  click counter for product 19. I would also store a raw data like event8 
> : (event => click, ad => 3, transac => 70 ...) and update the status of the 
> transaction to a "finish" state.
> 
> I want a really custom behaviour, so I guess I'll have to build a specific 
> flume sink (or is there a generic and configurable sink existing somewhere ?).
> 
> Maybe should I use the flume-cassandra-plugin and process the data once 
> already stored rawly ? In this case, how to be sure that I have proceed all 
> the data and how to be sure doing it in real-time or near real-time ? Is this 
> performant ?
> 
> I hope you'll understand what I just wrote, it's not very simple, and I'm not 
> fluent in English. Don't hesitate asking for more explanation.  
> 
> The final goal of all this is to have statistics in near real-time, on the 
> same cluster than the OLTP which is critical to us. The real-time statistics 
> have to be slowed (and become near real-time stats) when we are in rush hours 
> in order to be fully performant in the business part.
> 
> Alain
> 
> 2012/2/10 aaron morton <aa...@thelastpickle.com>
>> How to do it ? Do I need to build a custom plugin/sink or can I configure an 
>> existing sink to write data in a custom way ?
> This is a good starting point https://github.com/thobbs/flume-cassandra-plugin
> 
>> 2 - My business process also use my Cassandra DB (without flume, directly 
>> via thrift), how to ensure that log writing won't overload my database and 
>> introduce latency in my business process ?
> Anytime you have a data stream you don't control it's a good idea to put some 
> sort of buffer in there between the outside world and the database. Flume has 
> a buffered sync, I think your can subclass it and aggregate the counters for 
> a minute or two 
> http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics
> 
> Hope that helps. 
> A
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote:
> 
>> Hi,
>> 
>> 1 - I would like to generate some statistics and store some raw events from 
>> log files tailed with flume. I saw some plugins giving Cassandra sinks but I 
>> would like to store data in a custom way, storing raw data but also 
>> incrementing counters to get near real-time statistcis. How to do it ? Do I 
>> need to build a custom plugin/sink or can I configure an existing sink to 
>> write data in a custom way ?
>> 
>> 2 - My business process also use my Cassandra DB (without flume, directly 
>> via thrift), how to ensure that log writing won't overload my database and 
>> introduce latency in my business process ? I mean, is there a way to to 
>> manage the throughput sent by the flume's tails and slow them when my 
>> Cassandra cluster is overloaded ? I would like to avoid building 2 separated 
>> clusters.
>> 
>> Thank you,
>> 
>> Alain
>> 
> 
> 

Reply via email to