Igor, currently the best available disaster recovery strategy looks something like this:
- Use MirrorMaker to replicate data from a source cluster to a target cluster. MM should be co-located with the target cluster to minimize producer lag. If you have multiple active data centers, each would need it's own set of MM clusters to copy from the other DCs. - Use a topic naming convention and MM whitelists to prevent cycles; otherwise, messages would be replicated back-and-forth indefinitely between clusters. Alternatively you can use passive "aggregation clusters" in each DC to prevent cycles. - To failover producers, you can usually just point them at another cluster, unless your use-case would not allow this. Depending on your multi-cluster topic naming convention, you may need to change the topic that the producer sends to. - To failover consumers, you will need to point them at another cluster _and_ figure out what offsets they should resume consuming from, since offsets are not consistent between mirrored clusters. To do this, use timestamp-based offset translation as described below. Depending on your multi-cluster topic naming convention, you may need to change the topic the consumer reads from. - To failback consumers, you'll need to 1) ensure that the original cluster's MM clusters are caught up s.t. the original clusters have the latest records. Then use timestamp-based offset translation as described below. - To failback producers, just point them at their primary clusters again. Don't failback producers until after you failback consumers, or you'll have out-of-order issues. To translate consumers offsets based on timestamps, you need three pieces of information: 1) the max replication lag, i.e. how far behind MM could have been from real-time; 2) the max consumer lag, i.e. how far behind your consumers could have been from real-time; and 3) the time at which the failure occurred. Subtract the consumer lag and replication lag from the failure timestamp to get a recovery timestamp. This is the point in time you'll need to rewind your consumers to prevent them from skipping over records during failover. Then: 1) Bring up the consumer in the new cluster with auto.offset.reset=earliest. 2) Use kafka-consumer-groups.sh tool to "reset" each migrated consumer to the recovery timestamp. Something like: $ ./bin/kafka-consumer-groups.sh --reset-offsets --group foo --topic bar --to-datetime 123456 --boostrap-server xyz:9092 Do this for every consumer x every topic. Your consumers will consume any records that were replicated to the target cluster prior to the failure, and will process any new records produced to the target cluster during failover. The consumers will see records in roughly the right order -- new records won't be processed before old records. N.B. the following issues: - Your consumers will see duplicate records, since they will have been reset to a previous point in time. - The order of records in the source and target clusters will not be consistent - If you failback producers before consumers, your consumers will see new records intermixed with old records. - The # of partitions and partitioning schemes may not be consistent between source and target clusters across all topics. - There is a lot of manual intervention to get this to work. The process does not scale well when many consumers and topics are involved. I have published KIP-382 "MirrorMaker 2.0" which directly addresses this mess. With MM2, it's possible to automate failover and failback based on replication checkpoints (not timestamps) while maintaining consistent record order and partitioning, among other features. Would love your support for the KIP! Thanks. Ryanne -- www.ryannedolan.info On Mon, Nov 19, 2018 at 10:26 AM igor polyakov <ipoly...@hotmail.com> wrote: > What is the best way for Kafka multi-dc deployments to pick up processing > at another location if a primary location fails? What features of Kafka are > the most relevant for continuous processing? > > Sincerely, > Igor > Check these out ! > [Ad Placement 1]< > http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=b21f8e89-4a11-4b8a-9276-66a068c4f231> > [Ad Placement 2] < > http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=6588f140-01db-43c2-9c34-36edd8fcfec3> > [Ad Placement 3] < > http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=c8a1ae81-ffe9-4a63-96d8-2ef1a9588e16 > > > > > >