I'm using Kafka in a multi-tenant product. For legal reasons I need to be
able to delete all data for a customer on demand, which seems to be a bit
challenging.

I've come up with two solutions, but both have downsides.

1) Use a unique key for each message, enable log compaction, and run a
process that writes tombstones (NULL messages) for each message that should
be deleted. Since I hope to retain non-deleted messages indefinitely, this
process will start taking longer and longer. Also, it means that I lose the
option of using the message key for semantically significant purposes.

2) Encrypt all messages using a customer specific key. If the customer
demands that the data be deleted, I throw away the key. This adds overhead
when writing and reading data, makes it more complicated to integrate the
Kafka topics into other pieces of my infrastructure, and adds the
complexity of managing a key service that can securely generate and
communicate the keys between producers and consumers.

Approach 2) could be implemented as a network proxy that sits in front of
the Kafka cluster, transparently encrypting and decrypting messages. You'd
select a "stream" by adding a suffix to the topic of your messages, e.g.
"my.topic:some.stream", which would identify the key that should either be
generated or looked up and used to encrypt data. Each message body would be
wrapped in a structure that identified the key used and unwrapped and
decrypted when a consumer requests the message.

Does anyone have experience with this, or do you just let the Kafka topic
delete old messages? I'd much prefer keeping the data in Kafka forever, as
it's ideally suited for bootstrapping new systems, e.g. search indexes,
analytics, etc.

Best regards,
Daniel Schierbeck

Reply via email to