I would think that this is not a particularly great solution, as you will end up running into quite a few edge cases, and I can't see this scaling particularly well - how do you know which server to copy logs from in a clustered and replicated environment? What happens when Kafka detects a failure and moves partition replicas to a different node? The reason that the Kafka Consumer APIs exist is to shield you from having to think about these things. In addition, you would be tightly coupling yourself to Kafka's internal log format; in my experience, this sort of thing rarely ends well.
Depending on your use case, Flume is a reasonable solution, if you don't want to use Camus; it has a Kafka source that allows you to stream data out of Kafka and into HDFS: http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/ -Will On Fri, Mar 13, 2015 at 2:33 PM, Alberto Miorin <amiorin78+ka...@gmail.com> wrote: > I was wondering if anybody has already tried to mirror a kafka topic to > hdfs just copying the log files from the topic directory of the broker > (like 00000000000023244237.log). > > The file format is very simple : > https://twitter.com/amiorin/status/576448691139121152/photo/1 > > Implementing an InputFormat should not be so difficult. > > Any drawbacks? >