We have been using Flume to solve a very similar usecase. Our servers write
the log files to a local file system, and then we have flume agent which
ships the data to kafka.

Flume you can use as exec source running tail. Though the exec source runs
well with tail, there are issues if the agent goes down or the file channel
starts building up. If the agent goes down, you can request flume exec tail
source to go back n number of lines or read from beginning of the file. The
challenge is we roll our log files on a daily basis. What if goes down in
the evening. We need to go back to the entire days worth of data for
reprocessing which slows down the data flow. We can also go back arbitarily
number of lines, but then we dont know what is the right number to go back.
This is kind of challenge for us. We have tried spooling directory. Which
works, but we need to have a different log file rotation policy. We
considered evening going a file rotation for a minute, but it will  still
affect the real time data flow in our kafka--->storm-->Elastic search
pipeline with a minute delay.

We are going to do a poc on logstash to see how this solves the problem of
flume.

On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. <fot...@gmail.com> wrote:

> Hi all,
>     I'm evaluating using Kafka.
>
> I liked this thing of Facebook scribe that you log to your own machine and
> then there's a separate process that forwards messages to the central
> logger.
>
> With Kafka it seems that I have to embed the publisher in my app, and deal
> with any communication problem managing that on the producer side.
>
> I googled quite a bit trying to find a project that would basically use
> daemon that parses a log file and send the lines to the Kafka cluster
> (something like a tail file.log but instead of redirecting the output to
> the console: send it to kafka)
>
> Does anyone knows about something like that?
>
>
> Thanks!
> Fernando.
>

Reply via email to