Dong,

Thanks for the comment.

There are two retention policy: log compaction and time based retention.

Log compaction:

we have use cases to keep infinite retention of a topic (only compaction).
GDPR cares about deletion of PII  (personal identifiable information) data.
Since Kafka doesn't know what records contain PII, it relies on upper layer
to delete those records.
For those infinite retention uses uses,  kafka needs to provide a way to
enforce compaction on time. This is what we try to address in this KIP.

Time based retention,

There are also use cases that users of Kafka might want to expire all their
data.
In those cases, they can use time based retention of their topics.


Regarding your first question,  if a user wants to delete a key in the log
compaction topic,  the user has to send a deletion using the same key.
Kafka only makes sure the deletion will happen under a certain time periods
(like 2 days/7 days).

Regarding your second question.  In most cases, we might want to delete all
duplicated keys at the same time.
Compaction might be more efficient since we need to scan the log and find
all duplicates.  However,  the expected use case is to set the time based
compaction interval on the order of days,  and be larger than 'min
compaction lag".  We don't want log compaction to happen frequently since
it is expensive.  The purpose is to help low production rate topic to get
compacted on time.  For the topic with "normal" incoming message message
rate, the "min dirty ratio" might have triggered the compaction before this
time based compaction policy takes effect.


Eno,

For your question,  like I mentioned we have long time retention use case
for log compacted topic, but we want to provide ability to delete certain
PII records on time.
Kafka itself doesn't know whether a record contains sensitive information
and relies on the user for deletion.


On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <lindon...@gmail.com> wrote:

> Hey Xiongqi,
>
> Thanks for the KIP. I have two questions regarding the use-case for meeting
> GDPR requirement.
>
> 1) If I recall correctly, one of the GDPR requirement is that we can not
> keep messages longer than e.g. 30 days in storage (e.g. Kafka). Say there
> exists a partition p0 which contains message1 with key1 and message2 with
> key2. And then user keeps producing messages with key=key2 to this
> partition. Since message1 with key1 is never overridden, sooner or later we
> will want to delete message1 and keep the latest message with key=key2. But
> currently it looks like log compact logic in Kafka will always put these
> messages in the same segment. Will this be an issue?
>
> 2) The current KIP intends to provide the capability to delete a given
> message in log compacted topic. Does such use-case also require Kafka to
> keep the messages produced before the given message? If yes, then we can
> probably just use AdminClient.deleteRecords() or time-based log retention
> to meet the use-case requirement. If no, do you know what is the GDPR's
> requirement on time-to-deletion after user explicitly requests the deletion
> (e.g. 1 hour, 1 day, 7 day)?
>
> Thanks,
> Dong
>
>
> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <xiongq...@gmail.com> wrote:
>
> > Hi Eno,
> >
> > The GDPR request we are getting here at linkedin is if we get a request
> to
> > delete a record through a null key on a log compacted topic,
> > we want to delete the record via compaction in a given time period like 2
> > days (whatever is required by the policy).
> >
> > There might be other issues (such as orphan log segments under certain
> > conditions)  that lead to GDPR problem but they are more like something
> we
> > need to fix anyway regardless of GDPR.
> >
> >
> > -- Xiongqi (Wesley) Wu
> >
> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska <eno.there...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > Thanks for the KIP. I'd like to see a more precise definition of what
> > part
> > > of GDPR you are targeting as well as some sort of verification that
> this
> > > KIP actually addresses the problem. Right now I find this a bit vague:
> > >
> > > "Ability to delete a log message through compaction in a timely manner
> > has
> > > become an important requirement in some use cases (e.g., GDPR)"
> > >
> > >
> > > Is there any guarantee that after this KIP the GDPR problem is solved
> or
> > do
> > > we need to do something else as well, e.g., more KIPs?
> > >
> > >
> > > Thanks
> > >
> > > Eno
> > >
> > >
> > >
> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu <xiongq...@gmail.com>
> wrote:
> > >
> > > > Hi Kafka,
> > > >
> > > > This KIP tries to address GDPR concern to fulfill deletion request on
> > > time
> > > > through time-based log compaction on a compaction enabled topic:
> > > >
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 354%3A+Time-based+log+compaction+policy
> > > >
> > > > Any feedback will be appreciated.
> > > >
> > > >
> > > > Xiongqi (Wesley) Wu
> > > >
> > >
> >
>



-- 
Xiongqi (Wesley) Wu

Reply via email to