[ https://issues.apache.org/jira/browse/KAFKA-7137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brett Rann updated KAFKA-7137: ------------------------------ Description: Just spent some time wrapping my head around the inner workings of compaction and tombstoning, with a view to providing guarantees for deleting previous values of tombstoned keys from kafka within a desired time. There's a couple of good posts that touch on this: https://www.confluent.io/blog/handling-gdpr-log-forget/ http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ Basically, log.cleaner.min.cleanable.ratio or min.cleanable.dirty.ratio is hijacked to force aggressive compaction (by setting it to 0, or 0.000000001 depending on what you read), and along with segment.ms can provide timing guarantees that a tombstone will result in any other values for the key will be deleted within a desired time. But that sacrifices the utility of min.cleanable.dirty.ratio (and to a lesser extent, control over segment sizes). On any duplicate key and a new segment roll it will run compaction, when otherwise it might be preferrable to allow a more generous dirty.ratio in the case of plain old duplicates. It would be useful to have control over triggering a compaction without losing the utility of the dirty.ratio setting. The pure need here is to specify a minimum time for the log cleaner to run (or a maximum time where it doesn't run!) on a topic that has keys replaced by a tombstone message that are past the minimum retention times provided by min.compaction.lag.ms Something like a log.cleaner.max.delay.ms, and an API to trigger compaction, with some nuances to be fleshed out. Does this make sense, and sound like it's worth a KIP? I'd be happy to write something up. In the mean time, this can be worked around with some duct tape: * make sure any values you want deleted by a tombstone have passed min retention configs * set global log.cleaner.io.max.bytes.per.second to what you want for the compaction task * set topic min.cleanable.dirty.ratio=0 for the topic * set a small segment.ms * wait for a new segment to roll (ms + a message coming in) and wait for compaction to kick in. GDPR met! * undo the hacks was: Just spent some time wrapping my head around the inner workings of compaction and tombstoning, with a view to providing guarantees for deleting previous values of tombstoned keys from kafka within a desired time. There's a couple of good posts that touch on this: https://www.confluent.io/blog/handling-gdpr-log-forget/ http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ Basically, log.cleaner.min.cleanable.ratio or min.cleanable.dirty.ratio is hijacked to force aggressive compaction (by setting it to 0, or 0.000000001 depending on what you read), and along with segment.ms can provide timing guarantees that a tombstone will result in any other values for the key will be deleted within a desired time. But that sacrifices the utility of min.cleanable.dirty.ratio (and to a lesser extent, control over segment sizes). On any duplicate key and a new segment roll it will run compaction, when otherwise it might be preferrable to allow a more generous dirty.ratio in the case of plain old duplicates. It would be useful to have control over triggering a compaction without losing the utility of the dirty.ratio setting. The pure need here is to specify a minimum time for the log cleaner to run on a topic that has keys replaced by a tombstone message that are past the minimum retention times provided by min.compaction.lag.ms Something like a log.cleaner.max.delay.ms, and an API to trigger compaction, with some nuances to be fleshed out. Does this make sense, and sound like it's worth a KIP? I'd be happy to write something up. In the mean time, this can be worked around with some duct tape: * make sure any values you want deleted by a tombstone have passed min retention configs * set global log.cleaner.io.max.bytes.per.second to what you want for the compaction task * set topic min.cleanable.dirty.ratio=0 for the topic * set a small segment.ms * wait for a new segment to roll (ms + a message coming in) and wait for compaction to kick in. GDPR met! * undo the hacks > ability to trigger compaction for tombstoning and GDPR > ------------------------------------------------------ > > Key: KAFKA-7137 > URL: https://issues.apache.org/jira/browse/KAFKA-7137 > Project: Kafka > Issue Type: Wish > Reporter: Brett Rann > Priority: Minor > > Just spent some time wrapping my head around the inner workings of compaction > and tombstoning, with a view to providing guarantees for deleting previous > values of tombstoned keys from kafka within a desired time. > There's a couple of good posts that touch on this: > https://www.confluent.io/blog/handling-gdpr-log-forget/ > http://www.shayne.me/blog/2015/2015-06-25-everything-about-kafka-part-2/ > Basically, log.cleaner.min.cleanable.ratio or min.cleanable.dirty.ratio is > hijacked to force aggressive compaction (by setting it to 0, or 0.000000001 > depending on what you read), and along with segment.ms can provide timing > guarantees that a tombstone will result in any other values for the key will > be deleted within a desired time. > But that sacrifices the utility of min.cleanable.dirty.ratio (and to a lesser > extent, control over segment sizes). On any duplicate key and a new segment > roll it will run compaction, when otherwise it might be preferrable to allow > a more generous dirty.ratio in the case of plain old duplicates. > It would be useful to have control over triggering a compaction without > losing the utility of the dirty.ratio setting. > The pure need here is to specify a minimum time for the log cleaner to run > (or a maximum time where it doesn't run!) on a topic that has keys replaced > by a tombstone message that are past the minimum retention times provided by > min.compaction.lag.ms > Something like a log.cleaner.max.delay.ms, and an API to trigger compaction, > with some nuances to be fleshed out. > Does this make sense, and sound like it's worth a KIP? I'd be happy to write > something up. > In the mean time, this can be worked around with some duct tape: > * make sure any values you want deleted by a tombstone have passed min > retention configs > * set global log.cleaner.io.max.bytes.per.second to what you want for the > compaction task > * set topic min.cleanable.dirty.ratio=0 for the topic > * set a small segment.ms > * wait for a new segment to roll (ms + a message coming in) and wait for > compaction to kick in. GDPR met! > * undo the hacks -- This message was sent by Atlassian JIRA (v7.6.3#76005)