Hi, Richard,

Thanks for drafting the KIP. A few comments below.

1. We need to provide a better motivation for the KIP. The goal of the KIP
is not to reorganize the checkpoint for log cleaning. It's just an
implementation detail. I was thinking that we could add sth like the
following in the Motivation/Problem section.

"The idea of the configuration delete.retention.ms for compacted topics is
to prevent an application that has read a key to not see a subsequent
deletion of the key because it's physically removed too early. To solve
this problem, from the latest possible time (deleteHorizonMs) that an
application could have read a non tombstone key before a tombstone, we
preserve that tombstone for at least delete.retention.ms and require the
application to complete the reading of the tombstone by then.

deleteHorizonMs is no later than the time when the cleaner has cleaned up
to the tombstone. After that time, no application can read a non-tombstone
key before the tombstone because they have all been cleaned away through
compaction. Since currently we don't explicitly store the time when a round
of cleaning completes, deleteHorizonMs is estimated by the last modified
time of the segment containing firstDirtyOffset. When merging multiple log
segments into a single one, the last modified time is inherited from the
last merged segment. So the last modified time of the newly merged segment
is actually not an accurate estimate of deleteHorizonMs. It could be
arbitrarily before (KAFKA-4545 <https://issues.apache.org/jira/browse/>) or
after (KAFKA-8522 <https://issues.apache.org/jira/browse/KAFKA-8522>). The
former causes the tombstone to be deleted too early, which can cause an
application to miss the deletion of a key. The latter causes the tombstone
to be retained longer than needed and potentially forever."

We probably want to change the title of the KIP accordingly.

2. The proposed implementation of the KIP is to remember the
firstDirtyOffset offset and the corresponding cleaning time in a checkpoint
file per partition and then use them to estimate deleteHorizonMs. It would
be useful to document the format of the new checkpoint file and how it will
be used during cleaning. Some examples will be helpful.

3. Thinking about this more. There is another way to solve this problem. We
could write the deleteHorizonMs for each tombstone as an internal header
field of the record (e.g., __deleteHorizonMs). That timestamp could be the
starting time of the log cleaner when the tombstone's offset is <=
firstDirtyOffset. We could use this timestamp to determine whether the
tombstone should be removed in subsequent rounds of cleaning. This way, we
can still keep the current per disk checkpoint file, which is more
efficient. Personally, I think this approach may be better. Could you
document this approach in the wiki as well so that we can discuss which one
to pick?


Jun


On Sun, Sep 1, 2019 at 7:45 PM Richard Yu <yohan.richard...@gmail.com>
wrote:

> Hi all,
>
> A KIP has been written that wishes to upgrade the checkpoint file system in
> log cleaner.
> If anybody wishes to comment, feel free to do so. :)
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-515%3A+Reorganize+checkpoint+file+system+in+log+cleaner+to+be+per+partition
> Above is the link for reference.
>
> Cheers,
> Richard
>

Reply via email to