I've been researching Kafka for our requirements and am trying to figure out the best way to implement multi-region failover (lowest complexity).
One requirement we have is that the offsets of the backup must match the primary. As I understand it, MirrorMaker does not (currently) guarantee that the target Kafka instance will have the same log offsets as the source Kafka instance. Our message processing pipeline will be strictly relying on topic-broker-partition-offset to avoid re-processing messages. Here's what I'm leaning towards, please share any crticism or thoughts: Assuming: - Two regions, Region1 (primary) and Region2 (backup) - Region2 must have the same offsets per topic-broker-partition-offset state - A few minutes of lost messages can be tolerated if Region1 is ever lost. - That it would be a mistake to attempt Kafka replication across regions and maintain a Zookeeper cluster across regions (because they weren't designed for the higher latency and link-loss issues and that there could be operational edge case bugs we won't catch/understand, etc) - That Region1 has multiple topics, brokers, partitions, replicas and a Zookeeper cluster. Only Region1 is in use operationally (gets all producer and consumer traffic). - That Region2 has the same configuration but receives no operational traffic (no producers, no consumers) but gets periodic rsync from Region1 - If Region1 is lost, we will start Kafka in Region2, it should startup at the appropriate offset (from last rysnc backup). Producers will be instructed to use Region2. - Region2 is now the new primary Kafka instance until we decide to switch back to Region1. This is quite simple and there is more data loss than I'd like, but the loss would be acceptable for our use case, considering the loss of Region1 should be a rare event (if ever). Questions: 1. Do you see any pitfalls or better ways to proceed? It seems this Kafka feature request would be a better solution (adding a MirrorMaker mode to maintain offsets https://issues.apache.org/jira/browse/KAFKA-658 ) one day. 2. What is the Rsync backup is interrupted when Region1 is lost? Is there the possibility the 2nd Kafka instance could be left in an un-workable state? For example, if a .log file is copied, but the corresponding .index is not completed. Can the .index file be re-created? It appears it can in 8.1 https://issues.apache.org/jira/browse/KAFKA-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Thank you!