Hi,

I have a question about log compaction. LogCleaner's JavaDoc states that:

{quote}

A message with key K and offset O is obsolete if there exists a message with key K and offset O' such that O < O'.

{/quote}

That works fine if messages are arriving "in-order", i.e. with timestamp assigned by log-append time (with some possible problems with clock synchronization during leader rebalance), but if topic might contain messages, that are late (because producer explicitly assignes timestamp to each message), then compacting purely by offset might cause message with older timestamp to be kept in the log in favor of newer message. Is this intentional? Would it be possible to relax this so that the log compaction would prefer message's timestamp instead of offset? What if the behavior of the LogCleaner would be changed to something like this:

{quote}

A message with key K, timestamp T1 and offset O1 is obsolete if there exists a message with key K, timestamp T2 and offset O2' such that T1 < T2 or T1 = T2 and O1 < O2'.

{/quote}

I'm aware that this would be much more complicated (because of the clock synchronization problem that would have to be resolved), but this definition seems to be more aligned with time characteristic of the data. Should I try to create a KIP or this was already discussed and considered unwanted (or even impossible) feature?

Thanks for any comments,

 Jan

Reply via email to