Hey Christian,

my understanding is that you have an upstream system publishing  data via
Kafka topic to a downstream system, and your goal is to delete the PII data
both from Kafka and the downstream system via a message published through
the same topic. Is my understanding correct? Does the coordinator expect
some reply message from the downstream system (e.g.
"AnonymizationSuccessfulEvent")
Do you maybe want to prevent downstream systems accessing PII in in-flight
messages too if a delete request happens in the meantime?
Do you have a log-compacted or not compacted topic?

Everything below is for the data retention within Kafka topics, the
downstream system is not in scope:

For retention in non-compacted topics, you can expect that only messages
**published** in the the last retention.ms are in the topic, everything
before is deleted - so you could do something like set the retention.ms to
10 seconds, and have the coordinator simply assume that after 10 seconds
the data being deleted (to be honest I'm unaware of a method how you can
check if a given message was deleted or not - other than re-reading the
topic from the beginning). Naturally this solution would carry the
requirement that the downstream system processes the messages within the
same amount of time so that no messages are lost. This is something the
definitely requires fine tuning

For retention in compacted topics: Kafka will not automatically compact
messages - to trigger it, you need to publish a tombstone record. So even
with low activity there must be a new message for the deletion to occur
(triggering compaction). My understanding of the documentation is that
using a short segment.ms configuration (something like 1 second), you
should be able to assume that the compaction has occured, so only the
tombstone record remains in the topic. In this case the coordinator can
also assume that segment.ms after publishing the tombstone record the data
is gone from Kafka.

Kind regards,
Sandor


On Wed, 19 Aug 2020 at 19:49, Apolloni, Christian <
christian.apoll...@baloise.ch> wrote:

> Hi Sandor, thanks again for your reply.
>
> > If you have a non-log-compacted topic, after `retention.ms` the
> message>
> > (along with the PII) gets deleted from the Kafka message store without
> any>
> > further action, which should satisfy GDPR requirements:>
> > - you are handling PII in Kafka for a limited amount of time>
> > - you are processing the data for the given purpose it was given>
> > - the data will automatically be deleted without any further steps>
> > If you have a downstream system, you should also be able to publish a>
> > message through Kafka so that the downstream system executes its delete>
> > processes - if required. We implemented a similar process where we>
> > published an AnonymizeOrder event, which instructed downstream systems
> to>
> > anonymize the order data in their own data store.>
>
> Our problem is, the data could have been published shortly before the
> system receives a delete order from the "coordinator". This is because the
> data might have been mutated and the update needs to be propagated to
> consumer systems. If we go with a retention-period of days we would only be
> able to proceed with subsequent systems in the coordinated chain with too
> much of a delay. Going with an even shorter retention would be problematic.
>
> > If you have a log-compacted topic:>
> > - yes, I have the same understanding as you have on the active segment.>
> > - You can set the segment.ms>
> > <https://kafka.apache.org/documentation/#segment.ms> property to force
> the>
> > compaction to occur within an expected timeframe.>
> >
> > In general what I understand is true in both cases that Kafka gives you>
> > good enough guarantees to either remove the old message after
> retention.ms>
> > milliseconds or execute the topic compaction after segment.ms time that
> it>
> > is unnecessary to try to figure out more specifically in what exact
> moment>
> > the data is deleted. Setting these configurations should give you
> enough>
> > guarantee that the data removal will occur - if not, that imo should be>
> > considered a bug and reported back to the project.>
>
> We investigated the max.compaction.lag.ms parameter which was introduced
> in KIP-354 and from our understanding the intent is exactly what we'd like
> to accomplish, but unless we missed something we have noticed new segments
> are rolled only if new messages are appended. If the topic has very low
> activity it can be that no new message is appended and the segment is left
> active indefinitely. This means the cleaning for that segment might remain
> also indefinitely stalled. We are unsure whether our understanding is
> correct and whether it's a bug or not.
>
> In general, I think part of the issue is that the system receives the
> delete order at the time that it has to be performed: we don't deal with
> the processing of the required waiting periods, that's what happens in the
> "coordinator system". The system with the data to be deleted receives the
> order and has to perform the deletion immediately.
>
> Kind regards,
>
>  --
>  Christian Apolloni
>
>
>
> Disclaimer: The contents of this email and any attachment thereto are
> intended exclusively for the attention of the addressee(s). The email and
> any such attachment(s) may contain information that is confidential and
> protected on the strength of professional, official or business secrecy
> laws and regulations or contractual obligations. Should you have received
> this email by mistake, you may neither make use of nor divulge the contents
> of the email or of any attachment thereto. In such a case, please inform
> the email's sender and delete the message and all attachments without delay
> from your systems.
> You can find our e-mail disclaimer statement in other languages under
> http://www.baloise.ch/email_disclaimer
>

Reply via email to