Hi, After having a look to the talk https://www.confluent.io/kafka-summit-lon19/disaster-recovery-with-mirrormaker-2-0 and the https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0#KIP-382 I am trying to understand how I would use it in the setup that I have. For now, we just need to handle DR requirements, i.e., we would not need active-active
My requirements, more or less, are the following: 1) Currently, we have just one Kafka cluster "primary" where all the producers are producing to and where all the consumers are consuming from. 2) In case "primary" crashes, we would need to have other Kafka cluster "secondary" where we will move all the producer and consumers and keep working. 3) Once "primary" is recovered, we would need to move to it again (as we were in #1) To fullfill #2, I have thought to have a new Kafka cluster "secondary" and setup a replication procedure using MM2. However, it is not clear to me how to proceed. I would describe the high level details so you guys can point my misconceptions: A) Initial situation. As in the example of the KIP-382, in the primary cluster, we will have a local topic: "topic1" where the producers will produce to and the consumers will consume from. MM2 will create in the primary the remote topic "primary.topic1" where the local topic in the primary will be replicated. In addition, the consumer group information of primary will be also replicated. B) Kafka primary cluster is not available. Producers are moved to produce into the topic1 that it was manually created. In addition, consumers need to connect to secondary to consume the local topic "topic1" where the producers are now producing and from the remote topic "primary.topic1" where the producers were producing before, i.e., consumers will need to aggregate.This is so because some consumers could have lag so they will need to consume from both. In this situation, local topic "topic1" in the secondary will be modified with new messages and will be consumed (its consumption information will also change) but the remote topic "primary.topic1" will not receive new messages but it will be consumed (its consumption information will change) At this point, my conclusion is that consumers needs to consume from both topics (the new messages produced in the local topic and the old messages for consumers that had a lag) C) primary cluster is recovered (here is when the things get complicated for me). In the talk, the new primary is renamed a primary-2 and the MM2 is configured to active-active replication. The result is the following. The secondary cluster will end up with a new remote topic (primary-2.topic1) that will contain a replica of the new topic1 created in the primary-2 cluster. The primary-2 cluster will have 3 topics. "topic1" will be a new topic where in the near future producers will produce, "secondary.topic1" contains the replica of the local topic "topic1" in the secondary and "secondary.primary.topic1" that is "topic1" of the old primary (got through the secondary). D) Once all the replicas are in sync, producers and consumers will be moved to the primary-2. Producers will produce to local topic "topic1" of primary-2 cluster. The consumers will connect to primary-2 to consume from "topic1" (new messages that come in), "secondary.topic1" (messages produced during the outage) and from "secondary.primary.topic1" (old messages) If topics have a retention time, e.g. 7 days, we could remove "secondary.primary.topic1" after a few days, leaving the situation as at the beginning. However, if another problem happens in the middle, the number of topics could be a little difficult to handle. An additional question. If the topic is compacted, i.e.., the topic keeps forever, does switchover operations would imply add an additional path in the topic name? I would appreciate some guidance with this. Regards