Check out Camus. It was built to do parallel loads from Kafka into time bucketed directories in HDFS.
On Oct 16, 2014, at 9:32 AM, Gwen Shapira <gshap...@cloudera.com> wrote: > I assume the messages themselves contain the timestamp? > > If you use Flume, you can configure a Kafka source to pull data from > Kafka, use an interceptor to pull the date out of your message and > place it in the event header and then the HDFS sink can write to a > partition based on the timestamp. > > Gwen > > On Wed, Oct 15, 2014 at 8:47 PM, Jun Rao <jun...@gmail.com> wrote: >> One way you can do that is to continually load data from Kafka to Hadoop. >> During load, you put data into different HDFS directories based on the >> timestamp. The Hadoop admin can decide when to open up those directories >> for read based on whether data from all data centers have arrived. >> >> Thanks, >> >> Jun >> >> On Tue, Oct 14, 2014 at 11:54 PM, Alex Melville <amelvi...@g.hmc.edu> wrote: >> >>> Hi Apache Community, >>> >>> >>> My company has the following use case. We have multiple geographically >>> disparate data centers each with their own Kafka cluster, and we want to >>> aggregate all of these center's data to one central Kafka cluster located >>> in a data center distinct from the rest using MirrorMaker. Once in the >>> central cluster, most of this data will be fed into Hadoop for analytics >>> purposes. However, with how we have Hadoop working right now, it must wait >>> until it has received data from all of the other data centers for a >>> specific time period before it has the green light to load that data into >>> HDFS and process it. For example, say we have 3 remote (as in not central) >>> data centers, and DC1 has pushed to the central data center all of its data >>> up to 4:00 PM, DC2 has pushed everything up to 3:30 PM, and DC2 is lagging >>> behind and only pushed data up to the 2:00PM time period. Then Hadoop >>> processes all data tagged with modification times before 2:00PM, and it >>> must wait until DC3 catches up by pushing 2:15, 2:30, etc. data to the >>> central cluster before it can process the 3:00 PM data. >>> >>> So our question is: What is the best way to handle this time-period-ordered >>> requirement on our data using a distributed messaging log like Kafka? We >>> originally started using Kafka to move away from a batch-oriented backend >>> data pipeline transport system in favor of a more streaming-focused system, >>> but we still need to keep track of the latest common time period of data >>> streaming in from the remote clusters. >>> >>> >>> Cheers, >>> >>> Alex M. >>>