Hi Apache Community,
My company has the following use case. We have multiple geographically disparate data centers each with their own Kafka cluster, and we want to aggregate all of these center's data to one central Kafka cluster located in a data center distinct from the rest using MirrorMaker. Once in the central cluster, most of this data will be fed into Hadoop for analytics purposes. However, with how we have Hadoop working right now, it must wait until it has received data from all of the other data centers for a specific time period before it has the green light to load that data into HDFS and process it. For example, say we have 3 remote (as in not central) data centers, and DC1 has pushed to the central data center all of its data up to 4:00 PM, DC2 has pushed everything up to 3:30 PM, and DC2 is lagging behind and only pushed data up to the 2:00PM time period. Then Hadoop processes all data tagged with modification times before 2:00PM, and it must wait until DC3 catches up by pushing 2:15, 2:30, etc. data to the central cluster before it can process the 3:00 PM data. So our question is: What is the best way to handle this time-period-ordered requirement on our data using a distributed messaging log like Kafka? We originally started using Kafka to move away from a batch-oriented backend data pipeline transport system in favor of a more streaming-focused system, but we still need to keep track of the latest common time period of data streaming in from the remote clusters. Cheers, Alex M.