I would think that this is not a particularly great solution, as you will
end up running into quite a few edge cases, and I can't see this scaling
particularly well - how do you know which server to copy logs from in a
clustered and replicated environment? What happens when Kafka detects a
failure and moves partition replicas to a different node? The reason that
the Kafka Consumer APIs exist is to shield you from having to think about
these things. In addition, you would be tightly coupling yourself to
Kafka's internal log format; in my experience, this sort of thing rarely
ends well.
Depending on your use case, Flume is a reasonable solution, if you don't
want to use Camus; it has a Kafka source that allows you to stream data out
of Kafka and into HDFS:
http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

-Will

On Fri, Mar 13, 2015 at 2:33 PM, Alberto Miorin <amiorin78+ka...@gmail.com>
wrote:

> I was wondering if anybody has already tried to mirror a kafka topic to
> hdfs just copying the log files from the topic directory of the broker
> (like 00000000000023244237.log).
>
> The file format is very simple :
> https://twitter.com/amiorin/status/576448691139121152/photo/1
>
> Implementing an InputFormat should not be so difficult.
>
> Any drawbacks?
>

Reply via email to