Re: [VOTE] KIP-534: Retain tombstones for approximately delete.retention.ms milliseconds

Jason Gustafson Thu, 17 Oct 2019 10:41:28 -0700

Hi Guozhang,

It's a fair point. For control records, I think it's a non-issue since they
are tiny and not batched. So the only case where this might matter is large
batch deletions. I think the size difference is not a major issue itself,
but I think it's worth mentioning in the KIP the risk of exceeding the max
message size. I think the code should probably make this more of a soft
limit when cleaning. We have run into scenarios in the past as well where
recompression has actually increased message size. We may also want to be
able to upconvert messages to the new format in the future in the cleaner.


-Jason



On Thu, Oct 17, 2019 at 9:08 AM Guozhang Wang <wangg...@gmail.com> wrote:

> Here's my understanding: when log compaction kicks in, the system time at
> the moment would be larger than the message timestamp to be compacted, so
> the modification on the batch timestamp would practically be increasing its
> value, and hence the deltas for each inner message would be negative to
> maintain their actual timestamp. Depending on the time diff between the
> actual timestamp of the message and the time when log compaction happens,
> this negative delta can be large or small since it not long depends on the
> cleaner thread wakeup frequency but also dirty ratio etc.
>
> With varInt encoding, the num.bytes needed for encode an int varies from 1
> to 5 bytes; before compaction, the deltas should be relatively small
> positive values compared with the base timestamp, and hence most likely 1
> or 2 bytes needed to encode, after compaction, the deltas could be
> relatively large negative values that may take more bytes to encode. With a
> record batch of 512 in practice, and suppose after compaction each record
> would take 2 more byte for encoding deltas, that would be 1K more per
> batch. Usually it would not be too big of an issue with reasonable sized
> message, but I just wanted to point out this as a potential regression.
>
>
> Guozhang
>
> On Wed, Oct 16, 2019 at 9:36 PM Richard Yu <yohan.richard...@gmail.com>
> wrote:
>
> > Hi Guozhang,
> >
> > Your understanding basically is on point.
> >
> > I haven't looked into the details for what happens if we change the base
> > timestamp and how its calculated, so I'm not clear on how small the delta
> > (or big) is.
> > To be fair, would the the delta size pose a big problem if it takes up
> more
> > bytes to encode?
> >
> > Cheers,
> > Richard
> >
> > On Wed, Oct 16, 2019 at 7:36 PM Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > Hello Richard,
> > >
> > > Thanks for the KIP, I just have one clarification regarding "So the
> idea
> > is
> > > to set the base timestamp to the delete horizon and adjust the deltas
> > > accordingly." My understanding is that during compaction, for each
> > > compacted new segment, we would set its base offset of each batch as
> the
> > > delete horizon, which is the "current system time that cleaner has seen
> > so
> > > far", and adjust the delta timestamps of each of the inner records of
> the
> > > batch (and practically the deltas will be all negative)?
> > >
> > > If that's case, could we do some back of the envelope calculation on
> > what's
> > > the possible smallest case of deltas? Note that since we use varInt for
> > > delta values for each record, the smaller the negative delta, that
> would
> > > take more bytes to encode.
> > >
> > > Guozhang
> > >
> > > On Wed, Oct 16, 2019 at 6:48 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > +1. Thanks Richard.
> > > >
> > > > On Wed, Oct 16, 2019 at 10:04 AM Richard Yu <
> > yohan.richard...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Want to try to get this KIP wrapped up. So it would be great if we
> > can
> > > > get
> > > > > some votes.
> > > > >
> > > > > Cheers,
> > > > > Richard
> > > > >
> > > > > On Tue, Oct 15, 2019 at 12:58 PM Jun Rao <j...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Richard,
> > > > > >
> > > > > > Thanks for the updated KIP. +1 from me.
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Oct 15, 2019 at 12:46 PM Richard Yu <
> > > > yohan.richard...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jun,
> > > > > > >
> > > > > > > I've updated the link accordingly. :)
> > > > > > > Here is the updated KIP link:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-534%3A+Retain+tombstones+and+transaction+markers+for+approximately+delete.retention.ms+milliseconds
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Richard
> > > > > > >
> > > > > > > On Mon, Oct 14, 2019 at 5:12 PM Jun Rao <j...@confluent.io>
> > wrote:
> > > > > > >
> > > > > > > > Hi, Richard,
> > > > > > > >
> > > > > > > > Thanks for the KIP. Looks good to me overall. A few minor
> > > comments
> > > > > > below.
> > > > > > > >
> > > > > > > > 1. Could you change the title from "Retain tombstones" to
> > "Retain
> > > > > > > > tombstones and transaction markers" to make it more general?
> > > > > > > >
> > > > > > > > 2. Could you document which bit in the batch attribute will
> be
> > > used
> > > > > for
> > > > > > > the
> > > > > > > > new flag? The current format of the batch attribute is the
> > > > following.
> > > > > > > >
> > > > > > > > *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> -------------------------------------------------------------------------------------------------
> > > > > > > > *  | Unused (6-15) | Control (5) | Transactional (4) |
> > Timestamp
> > > > Type
> > > > > > > > (3) | Compression Type (0-2) |
> > > > > > > >
> > > > > > > > *
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> -------------------------------------------------------------------------------------------------
> > > > > > > >
> > > > > > > > 3. Could you provide the reasons for the rejected proposals?
> > For
> > > > > > > > proposal 1, one reason is that it doesn't cover the
> transaction
> > > > > > > > markers. For proposal 2, one reason is that the interval
> record
> > > > > header
> > > > > > > > could be exposed to the clients.
> > > > > > > >
> > > > > > > >
> > > > > > > > Jun
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Oct 14, 2019 at 4:42 PM Richard Yu <
> > > > > yohan.richard...@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > The discussion for KIP-534 seems to have concluded.
> > > > > > > > > So I wish to vote this in so that we can get it done. Its a
> > > small
> > > > > bug
> > > > > > > > fix.
> > > > > > > > > :)
> > > > > > > > >
> > > > > > > > > Below is the KIP link:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-534%3A+Retain+tombstones+for+approximately+delete.retention.ms+milliseconds
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Richard
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [VOTE] KIP-534: Retain tombstones for approximately delete.retention.ms milliseconds

Reply via email to