Can you take a look at KIP-280: https://cwiki.apache.org/confluence/display/KAFKA/KIP-280%3A+Enhanced+log+compaction ?
On Mon, Aug 6, 2018 at 10:55 AM, Jan Lukavský <je...@seznam.cz> wrote: > Hi, > > I have a question about log compaction. LogCleaner's JavaDoc states that: > > {quote} > > A message with key K and offset O is obsolete if there exists a message > with key K and offset O' such that O < O'. > > {/quote} > > That works fine if messages are arriving "in-order", i.e. with timestamp > assigned by log-append time (with some possible problems with clock > synchronization during leader rebalance), but if topic might contain > messages, that are late (because producer explicitly assignes timestamp to > each message), then compacting purely by offset might cause message with > older timestamp to be kept in the log in favor of newer message. Is this > intentional? Would it be possible to relax this so that the log compaction > would prefer message's timestamp instead of offset? What if the behavior of > the LogCleaner would be changed to something like this: > > {quote} > > A message with key K, timestamp T1 and offset O1 is obsolete if there > exists a message with key K, timestamp T2 and offset O2' such that T1 < T2 > or T1 = T2 and O1 < O2'. > > {/quote} > > I'm aware that this would be much more complicated (because of the clock > synchronization problem that would have to be resolved), but this > definition seems to be more aligned with time characteristic of the data. > Should I try to create a KIP or this was already discussed and considered > unwanted (or even impossible) feature? > > Thanks for any comments, > > Jan > >