Hi Apache Community,

My company has the following use case. We have multiple geographically
disparate data centers each with their own Kafka cluster, and we want to
aggregate all of these center's data to one central Kafka cluster located
in a data center distinct from the rest using MirrorMaker. Once in the
central cluster, most of this data will be fed into Hadoop for analytics
purposes. However, with how we have Hadoop working right now, it must wait
until it has received data from all of the other data centers for a
specific time period before it has the green light to load that data into
HDFS and process it. For example, say we have 3 remote (as in not central)
data centers, and DC1 has pushed to the central data center all of its data
up to 4:00 PM, DC2 has pushed everything up to 3:30 PM, and DC2 is lagging
behind and only pushed data up to the 2:00PM time period. Then Hadoop
processes all data tagged with modification times before 2:00PM, and it
must wait until DC3 catches up by pushing 2:15, 2:30, etc. data to the
central cluster before it can process the 3:00 PM data.

So our question is: What is the best way to handle this time-period-ordered
requirement on our data using a distributed messaging log like Kafka? We
originally started using Kafka to move away from a batch-oriented backend
data pipeline transport system in favor of a more streaming-focused system,
but we still need to keep track of the latest common time period of data
streaming in from the remote clusters.


Cheers,

Alex M.

Reply via email to