Check out Camus.  It was built to do parallel loads from Kafka into time 
bucketed directories in HDFS.



On Oct 16, 2014, at 9:32 AM, Gwen Shapira <gshap...@cloudera.com> wrote:

> I assume the messages themselves contain the timestamp?
> 
> If you use Flume, you can configure a Kafka source to pull data from
> Kafka, use an interceptor to pull the date out of your message and
> place it in the event header and then the HDFS sink can write to a
> partition based on the timestamp.
> 
> Gwen
> 
> On Wed, Oct 15, 2014 at 8:47 PM, Jun Rao <jun...@gmail.com> wrote:
>> One way you can do that is to continually load data from Kafka to Hadoop.
>> During load, you put data into different HDFS directories based on the
>> timestamp. The Hadoop admin can decide when to open up those directories
>> for read based on whether data from all data centers have arrived.
>> 
>> Thanks,
>> 
>> Jun
>> 
>> On Tue, Oct 14, 2014 at 11:54 PM, Alex Melville <amelvi...@g.hmc.edu> wrote:
>> 
>>> Hi Apache Community,
>>> 
>>> 
>>> My company has the following use case. We have multiple geographically
>>> disparate data centers each with their own Kafka cluster, and we want to
>>> aggregate all of these center's data to one central Kafka cluster located
>>> in a data center distinct from the rest using MirrorMaker. Once in the
>>> central cluster, most of this data will be fed into Hadoop for analytics
>>> purposes. However, with how we have Hadoop working right now, it must wait
>>> until it has received data from all of the other data centers for a
>>> specific time period before it has the green light to load that data into
>>> HDFS and process it. For example, say we have 3 remote (as in not central)
>>> data centers, and DC1 has pushed to the central data center all of its data
>>> up to 4:00 PM, DC2 has pushed everything up to 3:30 PM, and DC2 is lagging
>>> behind and only pushed data up to the 2:00PM time period. Then Hadoop
>>> processes all data tagged with modification times before 2:00PM, and it
>>> must wait until DC3 catches up by pushing 2:15, 2:30, etc. data to the
>>> central cluster before it can process the 3:00 PM data.
>>> 
>>> So our question is: What is the best way to handle this time-period-ordered
>>> requirement on our data using a distributed messaging log like Kafka? We
>>> originally started using Kafka to move away from a batch-oriented backend
>>> data pipeline transport system in favor of a more streaming-focused system,
>>> but we still need to keep track of the latest common time period of data
>>> streaming in from the remote clusters.
>>> 
>>> 
>>> Cheers,
>>> 
>>> Alex M.
>>> 

Reply via email to