I am looking for advice on how to handle disasters not covered by the official methods of replication, whether intra-cluster replication (via replication factor and producer acks) or multi-cluster replication (using Confluent Replicator).
We are looking into using Kafka not only as a broker for decoupling, but also as event store. For us, data is mission critical and has unlimited retention with the option to compact (required by GDPR). We are especially interested in two types of disasters: 1. Unintentional or malicious use of the Kafka API to create or compact messages, as well as deletions of topics or logs. 2. Loss of tail data from partitions if all ISR fail in the primary cluster. A special case of this is the loss of the tail from the commit topic, which results in loss of consumer progress. Ad (1) Replication does not guard against disaster (1), as the resulting messages and deletions spread to the replicas. A naive solution is to simply secure the cluster, and ensure that there are no errors in the code (...), but anyone with an ounce of experience with software knows that stuff goes wrong. Ad (2) The potential of disaster (2) is much worse. In case of total data center loss, the secondary cluster will lose the tail of every partition not fully replicated. Besides the data loss itself, there is now inconsistencies between related topics and partitions, which breaks the state of the system. Granted, the likelihood of total center loss is not great, but there is a reason we have multi-center setups. The special case with loss of consumer offsets results in double processing of events, once we resume processing from an older offset. While idempotency is a solution, it might not always be possible nor desirable. Once any of these types of disaster has occurred, there is no means to restore lost data, and even worse, we cannot restore the cluster to a point in time where it is consistent. We could probably look our customers in the eyes and tell them that they lost a days worth of progress, but we cannot inform them of a state of perpetual inconsistency. The only solution we can think of right now is to shut the primary cluster down (ensuring that no new events are produced), and then copy all files to some secure location, i.e. effectively creating a backup, allowing restoration to a consistency point in time. Though the tail (backup-wise) will be lost in case of disaster, we are ensured a consistent state to restore to. As a useful side effect, such backup-files can also be used to create environments for test or other destructive purposes. Does anyone have a better idea? Venlig hilsen / Best regards Henning Røigaard-Petersen Principal Developer MSc in Mathematics h...@edlund.dk <mailto:h...@edlund.dk> Direct phone +45 36150272 [https://www.edlund.dk/sites/default/files/edlundnylogo1.png]<http://www.edlund.dk/> Edlund A/S La Cours Vej 7 DK-2000 Frederiksberg Tel +45 36150630 www.edlund.dk<http://www.edlund.dk/>