Re: [DISCUSS] KIP-515: Reorganize checkpoint system in log cleaner to be per partition

Jason Gustafson Thu, 19 Sep 2019 08:48:17 -0700

Hi Richard,

Just reposting my comment from the JIRA:


The underlying problem here also impacts the cleaning of transaction
markers. We use the same delete horizon in order to tell when it is safe to
remove the marker. If all the data from a transaction has been cleaned and
the delete horizon has passed enough time, then the marker is eligible for
deletion.

However, I don't think the same approach that we're proposing to fix the
problem for tombstones will work transaction markers. What we need to track
is the timestamp when all the records from a transaction have been removed.
That is when we start the timer for deletion. But this would be different
for every transaction and there is no guarantee that earlier transactions
will be eligible for deletion before later ones. It all depends on the keys
written in the transaction. I don't see an obvious way to solve this
problem without some record-level bookkeeping, but I might be missing
something.

Thanks,
Jason

On Mon, Sep 9, 2019 at 7:21 PM Richard Yu <yohan.richard...@gmail.com>
wrote:

> Hi Jun,
>
> Thanks for chipping in. :)
>
> The description you provided is pretty apt in describing the motivation of
> the KIP, so I will add it. I've made some changes to the KIP and outlined
> the basic approaches of what we have so far (basically changing the
> checkpoint file organization or incorporating an extra internal header
> field for a record). I will expand on them shortly.
>
> Any comments are appreciated!
>
> Cheers,
> Richard
>
> On Mon, Sep 9, 2019 at 3:10 PM Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Richard,
> >
> > Thanks for drafting the KIP. A few comments below.
> >
> > 1. We need to provide a better motivation for the KIP. The goal of the
> KIP
> > is not to reorganize the checkpoint for log cleaning. It's just an
> > implementation detail. I was thinking that we could add sth like the
> > following in the Motivation/Problem section.
> >
> > "The idea of the configuration delete.retention.ms for compacted topics
> is
> > to prevent an application that has read a key to not see a subsequent
> > deletion of the key because it's physically removed too early. To solve
> > this problem, from the latest possible time (deleteHorizonMs) that an
> > application could have read a non tombstone key before a tombstone, we
> > preserve that tombstone for at least delete.retention.ms and require the
> > application to complete the reading of the tombstone by then.
> >
> > deleteHorizonMs is no later than the time when the cleaner has cleaned up
> > to the tombstone. After that time, no application can read a
> non-tombstone
> > key before the tombstone because they have all been cleaned away through
> > compaction. Since currently we don't explicitly store the time when a
> round
> > of cleaning completes, deleteHorizonMs is estimated by the last modified
> > time of the segment containing firstDirtyOffset. When merging multiple
> log
> > segments into a single one, the last modified time is inherited from the
> > last merged segment. So the last modified time of the newly merged
> segment
> > is actually not an accurate estimate of deleteHorizonMs. It could be
> > arbitrarily before (KAFKA-4545 <https://issues.apache.org/jira/browse/>)
> > or
> > after (KAFKA-8522 <https://issues.apache.org/jira/browse/KAFKA-8522>).
> The
> > former causes the tombstone to be deleted too early, which can cause an
> > application to miss the deletion of a key. The latter causes the
> tombstone
> > to be retained longer than needed and potentially forever."
> >
> > We probably want to change the title of the KIP accordingly.
> >
> > 2. The proposed implementation of the KIP is to remember the
> > firstDirtyOffset offset and the corresponding cleaning time in a
> checkpoint
> > file per partition and then use them to estimate deleteHorizonMs. It
> would
> > be useful to document the format of the new checkpoint file and how it
> will
> > be used during cleaning. Some examples will be helpful.
> >
> > 3. Thinking about this more. There is another way to solve this problem.
> We
> > could write the deleteHorizonMs for each tombstone as an internal header
> > field of the record (e.g., __deleteHorizonMs). That timestamp could be
> the
> > starting time of the log cleaner when the tombstone's offset is <=
> > firstDirtyOffset. We could use this timestamp to determine whether the
> > tombstone should be removed in subsequent rounds of cleaning. This way,
> we
> > can still keep the current per disk checkpoint file, which is more
> > efficient. Personally, I think this approach may be better. Could you
> > document this approach in the wiki as well so that we can discuss which
> one
> > to pick?
> >
> >
> > Jun
> >
> >
> > On Sun, Sep 1, 2019 at 7:45 PM Richard Yu <yohan.richard...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > A KIP has been written that wishes to upgrade the checkpoint file
> system
> > in
> > > log cleaner.
> > > If anybody wishes to comment, feel free to do so. :)
> > >
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-515%3A+Reorganize+checkpoint+file+system+in+log+cleaner+to+be+per+partition
> > > Above is the link for reference.
> > >
> > > Cheers,
> > > Richard
> > >
> >
>

Re: [DISCUSS] KIP-515: Reorganize checkpoint system in log cleaner to be per partition

Reply via email to