Check out Camus. It was built to do parallel loads from Kafka into time
bucketed directories in HDFS.
On Oct 16, 2014, at 9:32 AM, Gwen Shapira wrote:
> I assume the messages themselves contain the timestamp?
>
> If you use Flume, you can configure a Kafka source to pull data from
> Kafka,
I assume the messages themselves contain the timestamp?
If you use Flume, you can configure a Kafka source to pull data from
Kafka, use an interceptor to pull the date out of your message and
place it in the event header and then the HDFS sink can write to a
partition based on the timestamp.
Gwen
One way you can do that is to continually load data from Kafka to Hadoop.
During load, you put data into different HDFS directories based on the
timestamp. The Hadoop admin can decide when to open up those directories
for read based on whether data from all data centers have arrived.
Thanks,
Jun
Hi Apache Community,
My company has the following use case. We have multiple geographically
disparate data centers each with their own Kafka cluster, and we want to
aggregate all of these center's data to one central Kafka cluster located
in a data center distinct from the rest using MirrorMaker.