[
https://issues.apache.org/jira/browse/KAFKA-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607648#comment-17607648
]
Nikolay Izhikov commented on KAFKA-13866:
-----------------------------------------
KIP-870 created -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-870%3A+Retention+policy+based+on+record+event+time
discussion thread -
https://lists.apache.org/thread/9njcnjd231l3l7xv121os99m3f7gggb3
> Support more advanced time retention policies
> ---------------------------------------------
>
> Key: KAFKA-13866
> URL: https://issues.apache.org/jira/browse/KAFKA-13866
> Project: Kafka
> Issue Type: Improvement
> Components: config, core, log cleaner
> Reporter: Matthias J. Sax
> Assignee: Nikolay Izhikov
> Priority: Major
> Labels: needs-kip
>
> Time-based retention policy compares the record timestamp to broker
> wall-clock time. Those semantics are questionable and also lead to issues for
> data reprocessing: If one want to re-process older data then retention time,
> it's not possible as broker expire those record aggressively and user need to
> increate the retention time accordingly.
> Especially for Kafka Stream, we have seen many cases when users got bit by
> the current behavior.
> It would be best, if Kafka would track _two_ timestamps per record: the
> record event-time (as the broker do currently), plus the log append-time
> (which is only tracked currently if the topic is configured with
> "append-time" tracking, but the issue is, that it overwrite the producer
> provided record event-time).
> Tracking both timestamps would allow to set a pure wall-clock time retention
> time plus a pure event-time retention time policy:
> * Wall-clock time: keep (at least) the date X days after writing
> * Event-time: keep (at max) the X days worth of event-time data
> Comparing wall-clock time to wall-clock time and event-time to event-time
> provides much cleaner semantics. The idea is to combine both policies and
> only expire data if both policies trigger.
> For the event-time policy, the broker would need to track "stream time" as
> max event-timestamp it has see per partition (similar to how Kafka Streams is
> tracking "stream time" client side).
> Note the difference between "at least" and "at max" above: for the
> data-reprocessing case, the max-based event-time policy avoids that the
> broker would keep a huge history for the reprocessing case.
> It would be part of a KIP discussion on the details how wall-clock/event-time
> and mix/max policies could be combined. For example, it might also be useful
> to have the following policy: keep at least X days worth of event-time
> history no matter how long the data is already stored (ie, there would only
> be an event-time base expiration but not wall-clock time). It could also be
> combined with a wall-clock time expiration: delete data only after it's at
> least X days old and stored for at least Y days.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)