Hi,
I have a question about log compaction. LogCleaner's JavaDoc states that:
{quote}
A message with key K and offset O is obsolete if there exists a message
with key K and offset O' such that O < O'.
{/quote}
That works fine if messages are arriving "in-order", i.e. with timestamp
assigned by log-append time (with some possible problems with clock
synchronization during leader rebalance), but if topic might contain
messages, that are late (because producer explicitly assignes timestamp
to each message), then compacting purely by offset might cause message
with older timestamp to be kept in the log in favor of newer message. Is
this intentional? Would it be possible to relax this so that the log
compaction would prefer message's timestamp instead of offset? What if
the behavior of the LogCleaner would be changed to something like this:
{quote}
A message with key K, timestamp T1 and offset O1 is obsolete if there
exists a message with key K, timestamp T2 and offset O2' such that T1 <
T2 or T1 = T2 and O1 < O2'.
{/quote}
I'm aware that this would be much more complicated (because of the clock
synchronization problem that would have to be resolved), but this
definition seems to be more aligned with time characteristic of the
data. Should I try to create a KIP or this was already discussed and
considered unwanted (or even impossible) feature?
Thanks for any comments,
Jan