To just clarify a bit on 1.  whether there's an external storage/DB isn't
relevant here.
Compacted topics allow a tombstone record to be sent (a null value for a
key) which
currently will result in old values for that key being deleted if some
conditions are met.
There are existing controls to make sure the old values will stay around
for a minimum
time at least, but no dedicated control to ensure the tombstone will delete
within a
maximum time.

One popular reason that maximum time for deletion is desirable right now is
GDPR with
PII. But we're not proposing any GDPR awareness in kafka, just being able
to guarantee
a max time where a tombstoned key will be removed from the compacted topic.

on 2)
huh, i thought it kept track of the first dirty segment and didn't
recompact older "clean" ones.
But I didn't look at code or test for that.

On Fri, Aug 17, 2018 at 10:57 AM xiongqi wu <xiongq...@gmail.com> wrote:

> 1, Owner of data (in this sense, kafka is the not the owner of data)
> should keep track of lifecycle of the data in some external storage/DB.
> The owner determines when to delete the data and send the delete request to
> kafka. Kafka doesn't know about the content of data but to provide a mean
> for deletion.
>
> 2 , each time compaction runs, it will start from first segments (no
> matter if it is compacted or not). The time estimation here is only used
> to determine whether we should run compaction on this log partition. So we
> only need to estimate uncompacted segments.
>
> On Thu, Aug 16, 2018 at 5:35 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Xiongqi,
> >
> > Thanks for the update. I have two questions for the latest KIP.
> >
> > 1) The motivation section says that one use case is to delete PII
> (Personal
> > Identifiable information) data within 7 days while keeping non-PII
> > indefinitely in compacted format. I suppose the use-case depends on the
> > application to determine when to delete those PII data. Could you explain
> > how can application reliably determine the set of keys that should be
> > deleted? Is application required to always messages from the topic after
> > every restart and determine the keys to be deleted by looking at message
> > timestamp, or is application supposed to persist the key-> timstamp
> > information in a separate persistent storage system?
> >
> > 2) It is mentioned in the KIP that "we only need to estimate earliest
> > message timestamp for un-compacted log segments because the deletion
> > requests that belong to compacted segments have already been processed".
> > Not sure if it is correct. If a segment is compacted before user sends
> > message to delete a key in this segment, it seems that we still need to
> > ensure that the segment will be compacted again within the given time
> after
> > the deletion is requested, right?
> >
> > Thanks,
> > Dong
> >
> > On Thu, Aug 16, 2018 at 10:27 AM, xiongqi wu <xiongq...@gmail.com>
> wrote:
> >
> > > Hi Xiaohe,
> > >
> > > Quick note:
> > > 1) Use minimum of segment.ms and max.compaction.lag.ms
> > > <http://max.compaction.ms
> <http://max.compaction.ms>>
> > >
> > > 2) I am not sure if I get your second question. first, we have jitter
> > when
> > > we roll the active segment. second, on each compaction, we compact upto
> > > the offsetmap could allow. Those will not lead to perfect compaction
> > storm
> > > overtime. In addition, I expect we are setting max.compaction.lag.ms
> on
> > > the order of days.
> > >
> > > 3) I don't have access to the confluent community slack for now. I am
> > > reachable via the google handle out.
> > > To avoid the double effort, here is my plan:
> > > a) Collect more feedback and feature requriement on the KIP.
> > > b) Wait unitl this KIP is approved.
> > > c) I will address any additional requirements in the implementation.
> (My
> > > current implementation only complies to whatever described in the KIP
> > now)
> > > d) I can share the code with the you and community see you want to add
> > > anything.
> > > e) submission through committee
> > >
> > >
> > > On Wed, Aug 15, 2018 at 11:42 PM, XIAOHE DONG <dannyriv...@gmail.com>
> > > wrote:
> > >
> > > > Hi Xiongqi
> > > >
> > > > Thanks for thinking about implementing this as well. :)
> > > >
> > > > I was thinking about using `segment.ms` to trigger the segment roll.
> > > > Also, its value can be the largest time bias for the record deletion.
> > For
> > > > example, if the `segment.ms` is 1 day and `max.compaction.ms` is 30
> > > days,
> > > > the compaction may happen around 31 days.
> > > >
> > > > For my curiosity, is there a way we can do some performance test for
> > this
> > > > and any tools you can recommend. As you know, previously, it is
> cleaned
> > > up
> > > > by respecting dirty ratio, but now it may happen anytime if max lag
> has
> > > > passed for each message. I wonder what would happen if clients send
> > huge
> > > > amount of tombstone records at the same time.
> > > >
> > > > I am looking forward to have a quick chat with you to avoid double
> > effort
> > > > on this. I am in confluent community slack during the work time. My
> > name
> > > is
> > > > Xiaohe Dong. :)
> > > >
> > > > Rgds
> > > > Xiaohe Dong
> > > >
> > > >
> > > >
> > > > On 2018/08/16 01:22:22, xiongqi wu <xiongq...@gmail.com> wrote:
> > > > > Brett,
> > > > >
> > > > > Thank you for your comments.
> > > > > I was thinking since we already has immediate compaction setting by
> > > > setting
> > > > > min dirty ratio to 0, so I decide to use "0" as disabled state.
> > > > > I am ok to go with -1(disable), 0 (immediate) options.
> > > > >
> > > > > For the implementation, there are a few differences between mine
> and
> > > > > "Xiaohe Dong"'s :
> > > > > 1) I used the estimated creation time of a log segment instead of
> > > largest
> > > > > timestamp of a log to determine the compaction eligibility,
> because a
> > > log
> > > > > segment might stay as an active segment up to "max compaction lag".
> > > (see
> > > > > the KIP for detail).
> > > > > 2) I measure how much bytes that we must clean to follow the "max
> > > > > compaction lag" rule, and use that to determine the order of
> > > compaction.
> > > > > 3) force active segment to roll to follow the "max compaction lag"
> > > > >
> > > > > I can share my code so we can coordinate.
> > > > >
> > > > > I haven't think about a new API to force a compaction. what is the
> > use
> > > > case
> > > > > for this one?
> > > > >
> > > > >
> > > > > On Wed, Aug 15, 2018 at 5:33 PM, Brett Rann
> > <br...@zendesk.com.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > We've been looking into this too.
> > > > > >
> > > > > > Mailing list:
> > > > > > https://lists.apache.org/thread.html/
> <https://lists.apache.org/thread.html/>
> > ed7f6a6589f94e8c2a705553f364ef
> > > > > > 599cb6915e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E
> > > > > > jira wish: https://issues.apache.org/jira/browse/KAFKA-7137
> <https://issues.apache.org/jira/browse/KAFKA-7137>
> > > > > > confluent slack discussion:
> > > > > > https://confluentcommunity.slack.com/archives/C49R61XMM/
> <https://confluentcommunity.slack.com/archives/C49R61XMM/>
> > > > p1530760121000039
> > > > > >
> > > > > > A person on my team has started on code so you might want to
> > > > coordinate:
> > > > > > https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-
> <https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log->
> > > > > > cleaner-compaction-max-lifetime-2.0
> > > > > >
> > > > > > He's been working with Jason Gustafson and James Chen around the
> > > > changes.
> > > > > > You can ping him on confluent slack as Xiaohe Dong.
> > > > > >
> > > > > > It's great to know others are thinking on it as well.
> > > > > >
> > > > > > You've added the requirement to force a segment roll which we
> > hadn't
> > > > gotten
> > > > > > to yet, which is great. I was content with it not including the
> > > active
> > > > > > segment.
> > > > > >
> > > > > > > Adding topic level configuration "max.compaction.lag.ms", and
> > > > > > corresponding broker configuration "
> log.cleaner.max.compaction.la
> > > g.ms
> > > > ",
> > > > > > which is set to 0 (disabled) by default.
> > > > > >
> > > > > > Glancing at some other settings convention seems to me to be -1
> for
> > > > > > disabled (or infinite, which is more meaningful here). 0 to me
> > > implies
> > > > > > instant, a little quicker than 1.
> > > > > >
> > > > > > We've been trying to think about a way to trigger compaction as
> > well
> > > > > > through an API call, which would need to be flagged somewhere (ZK
> > > > admin/
> > > > > > space?) but we're struggling to think how that would be
> coordinated
> > > > across
> > > > > > brokers and partitions. Have you given any thought to that?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu <xiongq...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Eno, Dong,
> > > > > > >
> > > > > > > I have updated the KIP. We decide not to address the issue that
> > we
> > > > might
> > > > > > > have for both compaction and time retention enabled topics (see
> > the
> > > > > > > rejected alternative item 2). This KIP will only ensure log can
> > be
> > > > > > > compacted after a specified time-interval.
> > > > > > >
> > > > > > > As suggested by Dong, we will also enforce "
> > max.compaction.lag.ms"
> > > > is
> > > > > > not
> > > > > > > less than "min.compaction.lag.ms".
> > > > > > >
> > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > Time-based
> > > > > > log
> > > > > > > compaction policy
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-354>
> > > > Time-based
> > > > > > log compaction policy>
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu <
> xiongq...@gmail.com
> > >
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Per discussion with Dong, he made a very good point that if
> > > > compaction
> > > > > > > > and time based retention are both enabled on a topic, the
> > > > compaction
> > > > > > > might
> > > > > > > > prevent records from being deleted on time. The reason is
> when
> > > > > > compacting
> > > > > > > > multiple segments into one single segment, the newly created
> > > > segment
> > > > > > will
> > > > > > > > have same lastmodified timestamp as latest original segment.
> We
> > > > lose
> > > > > > the
> > > > > > > > timestamp of all original segments except the last one. As a
> > > > result,
> > > > > > > > records might not be deleted as it should be through time
> based
> > > > > > > retention.
> > > > > > > >
> > > > > > > > With the current KIP proposal, if we want to ensure timely
> > > > deletion, we
> > > > > > > > have the following configurations:
> > > > > > > > 1) enable time based log compaction only : deletion is done
> > > though
> > > > > > > > overriding the same key
> > > > > > > > 2) enable time based log retention only: deletion is done
> > though
> > > > > > > > time-based retention
> > > > > > > > 3) enable both log compaction and time based retention:
> > Deletion
> > > > is not
> > > > > > > > guaranteed.
> > > > > > > >
> > > > > > > > Not sure if we have use case 3 and also want deletion to
> happen
> > > on
> > > > > > time.
> > > > > > > > There are several options to address deletion issue when
> enable
> > > > both
> > > > > > > > compaction and retention:
> > > > > > > > A) During log compaction, looking into record timestamp to
> > delete
> > > > > > expired
> > > > > > > > records. This can be done in compaction logic itself or use
> > > > > > > > AdminClient.deleteRecords() . But this assumes we have record
> > > > > > timestamp.
> > > > > > > > B) retain the lastModifed time of original segments during
> log
> > > > > > > compaction.
> > > > > > > > This requires extra meta data to record the information or
> not
> > > > grouping
> > > > > > > > multiple segments into one during compaction.
> > > > > > > >
> > > > > > > > If we have use case 3 in general, I would prefer solution A
> and
> > > > rely on
> > > > > > > > record timestamp.
> > > > > > > >
> > > > > > > >
> > > > > > > > Two questions:
> > > > > > > > Do we have use case 3? Is it nice to have or must have?
> > > > > > > > If we have use case 3 and want to go with solution A, should
> we
> > > > > > introduce
> > > > > > > > a new configuration to enforce deletion by timestamp?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu <
> > xiongq...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Dong,
> > > > > > > >>
> > > > > > > >> Thanks for the comment.
> > > > > > > >>
> > > > > > > >> There are two retention policy: log compaction and time
> based
> > > > > > retention.
> > > > > > > >>
> > > > > > > >> Log compaction:
> > > > > > > >>
> > > > > > > >> we have use cases to keep infinite retention of a topic
> (only
> > > > > > > >> compaction). GDPR cares about deletion of PII (personal
> > > > identifiable
> > > > > > > >> information) data.
> > > > > > > >> Since Kafka doesn't know what records contain PII, it relies
> > on
> > > > upper
> > > > > > > >> layer to delete those records.
> > > > > > > >> For those infinite retention uses uses, kafka needs to
> > provide a
> > > > way
> > > > > > to
> > > > > > > >> enforce compaction on time. This is what we try to address
> in
> > > this
> > > > > > KIP.
> > > > > > > >>
> > > > > > > >> Time based retention,
> > > > > > > >>
> > > > > > > >> There are also use cases that users of Kafka might want to
> > > expire
> > > > all
> > > > > > > >> their data.
> > > > > > > >> In those cases, they can use time based retention of their
> > > topics.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Regarding your first question, if a user wants to delete a
> key
> > > in
> > > > the
> > > > > > > >> log compaction topic, the user has to send a deletion using
> > the
> > > > same
> > > > > > > key.
> > > > > > > >> Kafka only makes sure the deletion will happen under a
> certain
> > > > time
> > > > > > > >> periods (like 2 days/7 days).
> > > > > > > >>
> > > > > > > >> Regarding your second question. In most cases, we might want
> > to
> > > > delete
> > > > > > > >> all duplicated keys at the same time.
> > > > > > > >> Compaction might be more efficient since we need to scan the
> > log
> > > > and
> > > > > > > find
> > > > > > > >> all duplicates. However, the expected use case is to set the
> > > time
> > > > > > based
> > > > > > > >> compaction interval on the order of days, and be larger than
> > > 'min
> > > > > > > >> compaction lag". We don't want log compaction to happen
> > > frequently
> > > > > > since
> > > > > > > >> it is expensive. The purpose is to help low production rate
> > > topic
> > > > to
> > > > > > get
> > > > > > > >> compacted on time. For the topic with "normal" incoming
> > message
> > > > > > message
> > > > > > > >> rate, the "min dirty ratio" might have triggered the
> > compaction
> > > > before
> > > > > > > this
> > > > > > > >> time based compaction policy takes effect.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Eno,
> > > > > > > >>
> > > > > > > >> For your question, like I mentioned we have long time
> > retention
> > > > use
> > > > > > case
> > > > > > > >> for log compacted topic, but we want to provide ability to
> > > delete
> > > > > > > certain
> > > > > > > >> PII records on time.
> > > > > > > >> Kafka itself doesn't know whether a record contains
> sensitive
> > > > > > > information
> > > > > > > >> and relies on the user for deletion.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin <
> > lindon...@gmail.com>
> > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> Hey Xiongqi,
> > > > > > > >>>
> > > > > > > >>> Thanks for the KIP. I have two questions regarding the
> > use-case
> > > > for
> > > > > > > >>> meeting
> > > > > > > >>> GDPR requirement.
> > > > > > > >>>
> > > > > > > >>> 1) If I recall correctly, one of the GDPR requirement is
> that
> > > we
> > > > can
> > > > > > > not
> > > > > > > >>> keep messages longer than e.g. 30 days in storage (e.g.
> > Kafka).
> > > > Say
> > > > > > > there
> > > > > > > >>> exists a partition p0 which contains message1 with key1 and
> > > > message2
> > > > > > > with
> > > > > > > >>> key2. And then user keeps producing messages with key=key2
> to
> > > > this
> > > > > > > >>> partition. Since message1 with key1 is never overridden,
> > sooner
> > > > or
> > > > > > > later
> > > > > > > >>> we
> > > > > > > >>> will want to delete message1 and keep the latest message
> with
> > > > > > key=key2.
> > > > > > > >>> But
> > > > > > > >>> currently it looks like log compact logic in Kafka will
> > always
> > > > put
> > > > > > > these
> > > > > > > >>> messages in the same segment. Will this be an issue?
> > > > > > > >>>
> > > > > > > >>> 2) The current KIP intends to provide the capability to
> > delete
> > > a
> > > > > > given
> > > > > > > >>> message in log compacted topic. Does such use-case also
> > require
> > > > Kafka
> > > > > > > to
> > > > > > > >>> keep the messages produced before the given message? If
> yes,
> > > > then we
> > > > > > > can
> > > > > > > >>> probably just use AdminClient.deleteRecords() or time-based
> > log
> > > > > > > retention
> > > > > > > >>> to meet the use-case requirement. If no, do you know what
> is
> > > the
> > > > > > GDPR's
> > > > > > > >>> requirement on time-to-deletion after user explicitly
> > requests
> > > > the
> > > > > > > >>> deletion
> > > > > > > >>> (e.g. 1 hour, 1 day, 7 day)?
> > > > > > > >>>
> > > > > > > >>> Thanks,
> > > > > > > >>> Dong
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu <
> > > xiongq...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> > Hi Eno,
> > > > > > > >>> >
> > > > > > > >>> > The GDPR request we are getting here at linkedin is if we
> > > get a
> > > > > > > >>> request to
> > > > > > > >>> > delete a record through a null key on a log compacted
> > topic,
> > > > > > > >>> > we want to delete the record via compaction in a given
> time
> > > > period
> > > > > > > >>> like 2
> > > > > > > >>> > days (whatever is required by the policy).
> > > > > > > >>> >
> > > > > > > >>> > There might be other issues (such as orphan log segments
> > > under
> > > > > > > certain
> > > > > > > >>> > conditions) that lead to GDPR problem but they are more
> > like
> > > > > > > >>> something we
> > > > > > > >>> > need to fix anyway regardless of GDPR.
> > > > > > > >>> >
> > > > > > > >>> >
> > > > > > > >>> > -- Xiongqi (Wesley) Wu
> > > > > > > >>> >
> > > > > > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska <
> > > > > > > eno.there...@gmail.com>
> > > > > > > >>> > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > Hello,
> > > > > > > >>> > >
> > > > > > > >>> > > Thanks for the KIP. I'd like to see a more precise
> > > > definition of
> > > > > > > what
> > > > > > > >>> > part
> > > > > > > >>> > > of GDPR you are targeting as well as some sort of
> > > > verification
> > > > > > that
> > > > > > > >>> this
> > > > > > > >>> > > KIP actually addresses the problem. Right now I find
> > this a
> > > > bit
> > > > > > > >>> vague:
> > > > > > > >>> > >
> > > > > > > >>> > > "Ability to delete a log message through compaction in
> a
> > > > timely
> > > > > > > >>> manner
> > > > > > > >>> > has
> > > > > > > >>> > > become an important requirement in some use cases
> (e.g.,
> > > > GDPR)"
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > Is there any guarantee that after this KIP the GDPR
> > problem
> > > > is
> > > > > > > >>> solved or
> > > > > > > >>> > do
> > > > > > > >>> > > we need to do something else as well, e.g., more KIPs?
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > Thanks
> > > > > > > >>> > >
> > > > > > > >>> > > Eno
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu <
> > > > xiongq...@gmail.com>
> > > > > > > >>> wrote:
> > > > > > > >>> > >
> > > > > > > >>> > > > Hi Kafka,
> > > > > > > >>> > > >
> > > > > > > >>> > > > This KIP tries to address GDPR concern to fulfill
> > > deletion
> > > > > > > request
> > > > > > > >>> on
> > > > > > > >>> > > time
> > > > > > > >>> > > > through time-based log compaction on a compaction
> > enabled
> > > > > > topic:
> > > > > > > >>> > > >
> > > > > > > >>> > > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP->
> > > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP->>
> > > > > > > >>> > > > 354%3A+Time-based+log+compaction+policy
> > > > > > > >>> > > >
> > > > > > > >>> > > > Any feedback will be appreciated.
> > > > > > > >>> > > >
> > > > > > > >>> > > >
> > > > > > > >>> > > > Xiongqi (Wesley) Wu
> > > > > > > >>> > > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> Xiongqi (Wesley) Wu
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Xiongqi (Wesley) Wu
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Xiongqi (Wesley) Wu
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Brett Rann
> > > > > >
> > > > > > Senior DevOps Engineer
> > > > > >
> > > > > >
> > > > > > Zendesk International Ltd
> > > > > >
> > > > > > 395 Collins Street, Melbourne VIC 3000 Australia
> > > > > >
> > > > > > Mobile: +61 (0) 418 826 017
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Xiongqi (Wesley) Wu
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Xiongqi (Wesley) Wu
> > >
> >
>
>
>
> --
> Xiongqi (Wesley) Wu
>


-- 

Brett Rann

Senior DevOps Engineer


Zendesk International Ltd

395 Collins Street, Melbourne VIC 3000 Australia

Mobile: +61 (0) 418 826 017

Reply via email to