We have been using Flume to solve a very similar usecase. Our servers write the log files to a local file system, and then we have flume agent which ships the data to kafka.
Flume you can use as exec source running tail. Though the exec source runs well with tail, there are issues if the agent goes down or the file channel starts building up. If the agent goes down, you can request flume exec tail source to go back n number of lines or read from beginning of the file. The challenge is we roll our log files on a daily basis. What if goes down in the evening. We need to go back to the entire days worth of data for reprocessing which slows down the data flow. We can also go back arbitarily number of lines, but then we dont know what is the right number to go back. This is kind of challenge for us. We have tried spooling directory. Which works, but we need to have a different log file rotation policy. We considered evening going a file rotation for a minute, but it will still affect the real time data flow in our kafka--->storm-->Elastic search pipeline with a minute delay. We are going to do a poc on logstash to see how this solves the problem of flume. On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. <fot...@gmail.com> wrote: > Hi all, > I'm evaluating using Kafka. > > I liked this thing of Facebook scribe that you log to your own machine and > then there's a separate process that forwards messages to the central > logger. > > With Kafka it seems that I have to embed the publisher in my app, and deal > with any communication problem managing that on the producer side. > > I googled quite a bit trying to find a project that would basically use > daemon that parses a log file and send the lines to the Kafka cluster > (something like a tail file.log but instead of redirecting the output to > the console: send it to kafka) > > Does anyone knows about something like that? > > > Thanks! > Fernando. >