[ 
https://issues.apache.org/jira/browse/KAFKA-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608609#comment-17608609
 ] 

Nikolay Izhikov commented on KAFKA-13866:
-----------------------------------------

Hello, [~mjsax] Can you, please, share your feedback on KIP?

https://cwiki.apache.org/confluence/display/KAFKA/KIP-870%3A+Retention+policy+based+on+record+event+time

https://lists.apache.org/thread/9njcnjd231l3l7xv121os99m3f7gggb3

> Support more advanced time retention policies
> ---------------------------------------------
>
>                 Key: KAFKA-13866
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13866
>             Project: Kafka
>          Issue Type: Improvement
>          Components: config, core, log cleaner
>            Reporter: Matthias J. Sax
>            Assignee: Nikolay Izhikov
>            Priority: Major
>              Labels: needs-kip
>
> Time-based retention policy compares the record timestamp to broker 
> wall-clock time. Those semantics are questionable and also lead to issues for 
> data reprocessing: If one want to re-process older data then retention time, 
> it's not possible as broker expire those record aggressively and user need to 
> increate the retention time accordingly.
> Especially for Kafka Stream, we have seen many cases when users got bit by 
> the current behavior.
> It would be best, if Kafka would track _two_ timestamps per record: the 
> record event-time (as the broker do currently), plus the log append-time 
> (which is only tracked currently if the topic is configured with 
> "append-time" tracking, but the issue is, that it overwrite the producer 
> provided record event-time).
> Tracking both timestamps would allow to set a pure wall-clock time retention 
> time plus a pure event-time retention time policy:
>  * Wall-clock time: keep (at least) the date X days after writing
>  * Event-time: keep (at max) the X days worth of event-time data
> Comparing wall-clock time to wall-clock time and event-time to event-time 
> provides much cleaner semantics. The idea is to combine both policies and 
> only expire data if both policies trigger.
> For the event-time policy, the broker would need to track "stream time" as 
> max event-timestamp it has see per partition (similar to how Kafka Streams is 
> tracking "stream time" client side).
> Note the difference between "at least" and "at max" above: for the 
> data-reprocessing case, the max-based event-time policy avoids that the 
> broker would keep a huge history for the reprocessing case.
> It would be part of a KIP discussion on the details how wall-clock/event-time 
> and mix/max policies could be combined. For example, it might also be useful 
> to have the following policy: keep at least X days worth of event-time 
> history no matter how long the data is already stored (ie, there would only 
> be an event-time base expiration but not wall-clock time). It could also be 
> combined with a wall-clock time expiration: delete data only after it's at 
> least X days old and stored for at least Y days.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to