I'm using Kafka in a multi-tenant product. For legal reasons I need to be able to delete all data for a customer on demand, which seems to be a bit challenging.
I've come up with two solutions, but both have downsides. 1) Use a unique key for each message, enable log compaction, and run a process that writes tombstones (NULL messages) for each message that should be deleted. Since I hope to retain non-deleted messages indefinitely, this process will start taking longer and longer. Also, it means that I lose the option of using the message key for semantically significant purposes. 2) Encrypt all messages using a customer specific key. If the customer demands that the data be deleted, I throw away the key. This adds overhead when writing and reading data, makes it more complicated to integrate the Kafka topics into other pieces of my infrastructure, and adds the complexity of managing a key service that can securely generate and communicate the keys between producers and consumers. Approach 2) could be implemented as a network proxy that sits in front of the Kafka cluster, transparently encrypting and decrypting messages. You'd select a "stream" by adding a suffix to the topic of your messages, e.g. "my.topic:some.stream", which would identify the key that should either be generated or looked up and used to encrypt data. Each message body would be wrapped in a structure that identified the key used and unwrapped and decrypted when a consumer requests the message. Does anyone have experience with this, or do you just let the Kafka topic delete old messages? I'd much prefer keeping the data in Kafka forever, as it's ideally suited for bootstrapping new systems, e.g. search indexes, analytics, etc. Best regards, Daniel Schierbeck