Hey Christian, my understanding is that you have an upstream system publishing data via Kafka topic to a downstream system, and your goal is to delete the PII data both from Kafka and the downstream system via a message published through the same topic. Is my understanding correct? Does the coordinator expect some reply message from the downstream system (e.g. "AnonymizationSuccessfulEvent") Do you maybe want to prevent downstream systems accessing PII in in-flight messages too if a delete request happens in the meantime? Do you have a log-compacted or not compacted topic?
Everything below is for the data retention within Kafka topics, the downstream system is not in scope: For retention in non-compacted topics, you can expect that only messages **published** in the the last retention.ms are in the topic, everything before is deleted - so you could do something like set the retention.ms to 10 seconds, and have the coordinator simply assume that after 10 seconds the data being deleted (to be honest I'm unaware of a method how you can check if a given message was deleted or not - other than re-reading the topic from the beginning). Naturally this solution would carry the requirement that the downstream system processes the messages within the same amount of time so that no messages are lost. This is something the definitely requires fine tuning For retention in compacted topics: Kafka will not automatically compact messages - to trigger it, you need to publish a tombstone record. So even with low activity there must be a new message for the deletion to occur (triggering compaction). My understanding of the documentation is that using a short segment.ms configuration (something like 1 second), you should be able to assume that the compaction has occured, so only the tombstone record remains in the topic. In this case the coordinator can also assume that segment.ms after publishing the tombstone record the data is gone from Kafka. Kind regards, Sandor On Wed, 19 Aug 2020 at 19:49, Apolloni, Christian < christian.apoll...@baloise.ch> wrote: > Hi Sandor, thanks again for your reply. > > > If you have a non-log-compacted topic, after `retention.ms` the > message> > > (along with the PII) gets deleted from the Kafka message store without > any> > > further action, which should satisfy GDPR requirements:> > > - you are handling PII in Kafka for a limited amount of time> > > - you are processing the data for the given purpose it was given> > > - the data will automatically be deleted without any further steps> > > If you have a downstream system, you should also be able to publish a> > > message through Kafka so that the downstream system executes its delete> > > processes - if required. We implemented a similar process where we> > > published an AnonymizeOrder event, which instructed downstream systems > to> > > anonymize the order data in their own data store.> > > Our problem is, the data could have been published shortly before the > system receives a delete order from the "coordinator". This is because the > data might have been mutated and the update needs to be propagated to > consumer systems. If we go with a retention-period of days we would only be > able to proceed with subsequent systems in the coordinated chain with too > much of a delay. Going with an even shorter retention would be problematic. > > > If you have a log-compacted topic:> > > - yes, I have the same understanding as you have on the active segment.> > > - You can set the segment.ms> > > <https://kafka.apache.org/documentation/#segment.ms> property to force > the> > > compaction to occur within an expected timeframe.> > > > > In general what I understand is true in both cases that Kafka gives you> > > good enough guarantees to either remove the old message after > retention.ms> > > milliseconds or execute the topic compaction after segment.ms time that > it> > > is unnecessary to try to figure out more specifically in what exact > moment> > > the data is deleted. Setting these configurations should give you > enough> > > guarantee that the data removal will occur - if not, that imo should be> > > considered a bug and reported back to the project.> > > We investigated the max.compaction.lag.ms parameter which was introduced > in KIP-354 and from our understanding the intent is exactly what we'd like > to accomplish, but unless we missed something we have noticed new segments > are rolled only if new messages are appended. If the topic has very low > activity it can be that no new message is appended and the segment is left > active indefinitely. This means the cleaning for that segment might remain > also indefinitely stalled. We are unsure whether our understanding is > correct and whether it's a bug or not. > > In general, I think part of the issue is that the system receives the > delete order at the time that it has to be performed: we don't deal with > the processing of the required waiting periods, that's what happens in the > "coordinator system". The system with the data to be deleted receives the > order and has to perform the deletion immediately. > > Kind regards, > > -- > Christian Apolloni > > > > Disclaimer: The contents of this email and any attachment thereto are > intended exclusively for the attention of the addressee(s). The email and > any such attachment(s) may contain information that is confidential and > protected on the strength of professional, official or business secrecy > laws and regulations or contractual obligations. Should you have received > this email by mistake, you may neither make use of nor divulge the contents > of the email or of any attachment thereto. In such a case, please inform > the email's sender and delete the message and all attachments without delay > from your systems. > You can find our e-mail disclaimer statement in other languages under > http://www.baloise.ch/email_disclaimer >