I am looking for advice on how to handle disasters not covered by the official 
methods of replication, whether intra-cluster replication (via replication 
factor and producer acks) or multi-cluster replication (using Confluent 
Replicator).

We are looking into using Kafka not only as a broker for decoupling, but also 
as event store.
For us, data is mission critical and has unlimited retention with the option to 
compact (required by GDPR).

We are especially interested in two types of disasters:

1.       Unintentional or malicious use of the Kafka API to create or compact 
messages, as well as deletions of topics or logs.

2.       Loss of tail data from partitions if all ISR fail in the primary 
cluster. A special case of this is the loss of the tail from the commit topic, 
which results in loss of consumer progress.

Ad (1)
Replication does not guard against disaster (1), as the resulting messages and 
deletions spread to the replicas.
A naive solution is to simply secure the cluster, and ensure that there are no 
errors in the code (...), but anyone with an ounce of experience with software 
knows that stuff goes wrong.

Ad (2)
The potential of disaster (2) is much worse. In case of total data center loss, 
the secondary cluster will lose the tail of every partition not fully 
replicated. Besides the data loss itself, there is now inconsistencies between 
related topics and partitions, which breaks the state of the system.
Granted, the likelihood of total center loss is not great, but there is a 
reason we have multi-center setups.
The special case with loss of consumer offsets results in double processing of 
events, once we resume processing from an older offset. While idempotency is a 
solution, it might not always be possible nor desirable.

Once any of these types of disaster has occurred, there is no means to restore 
lost data, and even worse, we cannot restore the cluster to a point in time 
where it is consistent. We could probably look our customers in the eyes and 
tell them that they lost a days worth of progress, but we cannot inform them of 
a state of perpetual inconsistency.

The only solution we can think of right now is to shut the primary cluster down 
(ensuring that no new events are produced), and then copy all files to some 
secure location, i.e. effectively creating a backup, allowing restoration to a 
consistency point in time. Though the tail (backup-wise) will be lost in case 
of disaster, we are ensured a consistent state to restore to.
As a useful side effect, such backup-files can also be used to create 
environments for test or other destructive purposes.

Does anyone have a better idea?


Venlig hilsen / Best regards

Henning Røigaard-Petersen
Principal Developer
MSc in Mathematics

h...@edlund.dk <mailto:h...@edlund.dk>
Direct phone

+45 36150272


[https://www.edlund.dk/sites/default/files/edlundnylogo1.png]<http://www.edlund.dk/>

Edlund A/S
La Cours Vej 7
DK-2000 Frederiksberg
Tel +45 36150630
www.edlund.dk<http://www.edlund.dk/>

Reply via email to