Thanks for answering.

"This is a good starting point
https://github.com/thobbs/flume-cassandra-plugin "

I already saw that, but it only does a raw store of the logs. I would like
too store them in a "smart way", I mean I'd like to store logs to be able
to use information contained into them.

If I have rows like : (date action/event/id_ad/id_transac)

1 - 2012-02-17 18:22:09 track/display/4/70
2 - 2012-02-17 18:22:09 track/display/2/70
3 - 2012-02-17 18:22:09 track/display/3/70
4 - 2012-02-17 18:22:29 track/start/3/70
5 - 2012-02-17 18:22:39 track/firstQuartile/3/70
6 - 2012-02-17 18:22:46 track/midpoint/3/70
7 - 2012-02-17 18:22:53 track/complete/3/70
8 - 2012-02-17 18:23:02 track/click/3/70

I would like to process this logs to store in cassandra :

1 - increment the display counter for the ad 4, find the transac with id
"70" in my database to get the id_product (let's say it's 19) and then
increment the display counter for product 19. I would also store a raw data
like event1: (event => display, ad => 4, transac => 70 ...)

2 - ...
...

7 - ...

8 - increment the click counter for the ad 3, find the transac with id "70"
in my database to get the id_product (let's say it's 19) and then increment
the  click counter for product 19. I would also store a raw data like
event8 : (event => click, ad => 3, transac => 70 ...) and update the status
of the transaction to a "finish" state.

I want a really custom behaviour, so I guess I'll have to build a specific
flume sink (or is there a generic and configurable sink existing somewhere
?).

Maybe should I use the flume-cassandra-plugin and process the data once
already stored rawly ? In this case, how to be sure that I have proceed all
the data and how to be sure doing it in real-time or near real-time ? Is
this performant ?

I hope you'll understand what I just wrote, it's not very simple, and I'm
not fluent in English. Don't hesitate asking for more explanation.

The final goal of all this is to have statistics in near real-time, on the
same cluster than the OLTP which is critical to us. The real-time
statistics have to be slowed (and become near real-time stats) when we are
in rush hours in order to be fully performant in the business part.

Alain

2012/2/10 aaron morton <aa...@thelastpickle.com>

> How to do it ? Do I need to build a custom plugin/sink or can I configure
> an existing sink to write data in a custom way ?
>
> This is a good starting point
> https://github.com/thobbs/flume-cassandra-plugin
>
> 2 - My business process also use my Cassandra DB (without flume, directly
> via thrift), how to ensure that log writing won't overload my database and
> introduce latency in my business process ?
>
> Anytime you have a data stream you don't control it's a good idea to put
> some sort of buffer in there between the outside world and the database.
> Flume has a buffered sync, I think your can subclass it and aggregate the
> counters for a minute or two
> http://archive.cloudera.com/cdh/3/flume/UserGuide/#_buffered_sink_and_decorator_semantics
>
> Hope that helps.
> A
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10/02/2012, at 4:27 AM, Alain RODRIGUEZ wrote:
>
> Hi,
>
> 1 - I would like to generate some statistics and store some raw events
> from log files tailed with flume. I saw some plugins giving Cassandra sinks
> but I would like to store data in a custom way, storing raw data but also
> incrementing counters to get near real-time statistcis. How to do it ? Do I
> need to build a custom plugin/sink or can I configure an existing sink to
> write data in a custom way ?
>
> 2 - My business process also use my Cassandra DB (without flume, directly
> via thrift), how to ensure that log writing won't overload my database and
> introduce latency in my business process ? I mean, is there a way to to
> manage the throughput sent by the flume's tails and slow them when my
> Cassandra cluster is overloaded ? I would like to avoid building
> 2 separated clusters.
>
> Thank you,
>
> Alain
>
>
>

Reply via email to