Hi,

After having a look to the talk
https://www.confluent.io/kafka-summit-lon19/disaster-recovery-with-mirrormaker-2-0
and the
https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0#KIP-382
I am trying to understand how I would use it
in the setup that I have. For now, we just need to handle DR requirements,
i.e., we would not need active-active

My requirements, more or less, are the following:

1) Currently, we have just one Kafka cluster "primary" where all the
producers are producing to and where all the consumers are consuming from.
2) In case "primary" crashes, we would need to have other Kafka cluster
"secondary" where we will move all the producer and consumers and keep
working.
3) Once "primary" is recovered, we would need to move to it again (as we
were in #1)

To fullfill #2, I have thought to have a new Kafka cluster "secondary" and
setup a replication procedure using MM2. However, it is not clear to me how
to proceed.

I would describe the high level details so you guys can point my
misconceptions:

A) Initial situation. As in the example of the KIP-382, in the primary
cluster, we will have a local topic: "topic1" where the producers will
produce to and the consumers will consume from. MM2 will create in  the
primary the remote topic "primary.topic1" where the local topic in the
primary will be replicated. In addition, the consumer group information of
primary will be also replicated.

B) Kafka primary cluster is not available. Producers are moved to produce
into the topic1 that it was manually created. In addition, consumers need
to connect to
secondary to consume the local topic "topic1" where the producers are now
producing and from the remote topic  "primary.topic1" where the producers
were producing before, i.e., consumers will need to aggregate.This is so
because some consumers could have lag so they will need to consume from
both. In this situation, local topic "topic1" in the secondary will be
modified with new messages and will be consumed (its consumption
information will also change) but the remote topic "primary.topic1" will
not receive new messages but it will be consumed  (its consumption
information will change)

At this point, my conclusion is that consumers needs to consume from both
topics (the new messages produced in the local topic and the old messages
for consumers that had a lag)

C) primary cluster is recovered (here is when the things get complicated
for me). In the talk, the new primary is renamed a primary-2 and the MM2 is
configured to active-active replication.
The result is the following. The secondary cluster will end up with a new
remote topic (primary-2.topic1) that will contain a replica of the new
topic1 created in the primary-2 cluster. The primary-2 cluster will have 3
topics. "topic1" will be a new topic where in the near future producers
will produce, "secondary.topic1" contains the replica of the local topic
"topic1" in the secondary and "secondary.primary.topic1" that is "topic1"
of the old primary (got through the secondary).

D) Once all the replicas are in sync, producers and consumers will be moved
to the primary-2. Producers will produce to local topic "topic1" of
primary-2 cluster. The consumers
will connect to primary-2 to consume from "topic1" (new messages that come
in), "secondary.topic1" (messages produced during the outage) and from
"secondary.primary.topic1" (old messages)

If topics have a retention time, e.g. 7 days, we could remove
"secondary.primary.topic1" after a few days, leaving the situation as at
the beginning. However, if another problem happens in the middle, the
number of topics could be a little difficult to handle.

An additional question. If the topic is compacted, i.e.., the topic keeps
forever, does switchover operations would imply add an additional path in
the topic name?

I would appreciate some guidance with this.

Regards

Reply via email to