One way you can do that is to continually load data from Kafka to Hadoop. During load, you put data into different HDFS directories based on the timestamp. The Hadoop admin can decide when to open up those directories for read based on whether data from all data centers have arrived.
Thanks, Jun On Tue, Oct 14, 2014 at 11:54 PM, Alex Melville <amelvi...@g.hmc.edu> wrote: > Hi Apache Community, > > > My company has the following use case. We have multiple geographically > disparate data centers each with their own Kafka cluster, and we want to > aggregate all of these center's data to one central Kafka cluster located > in a data center distinct from the rest using MirrorMaker. Once in the > central cluster, most of this data will be fed into Hadoop for analytics > purposes. However, with how we have Hadoop working right now, it must wait > until it has received data from all of the other data centers for a > specific time period before it has the green light to load that data into > HDFS and process it. For example, say we have 3 remote (as in not central) > data centers, and DC1 has pushed to the central data center all of its data > up to 4:00 PM, DC2 has pushed everything up to 3:30 PM, and DC2 is lagging > behind and only pushed data up to the 2:00PM time period. Then Hadoop > processes all data tagged with modification times before 2:00PM, and it > must wait until DC3 catches up by pushing 2:15, 2:30, etc. data to the > central cluster before it can process the 3:00 PM data. > > So our question is: What is the best way to handle this time-period-ordered > requirement on our data using a distributed messaging log like Kafka? We > originally started using Kafka to move away from a batch-oriented backend > data pipeline transport system in favor of a more streaming-focused system, > but we still need to keep track of the latest common time period of data > streaming in from the remote clusters. > > > Cheers, > > Alex M. >