[ 
https://issues.apache.org/jira/browse/KAFKA-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabien LD updated KAFKA-6872:
-----------------------------
    Priority: Minor  (was: Major)

> Doc for log.roll.* is wrong
> ---------------------------
>
>                 Key: KAFKA-6872
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6872
>             Project: Kafka
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 1.0.0
>            Reporter: Fabien LD
>            Priority: Minor
>
> For {{log.roll.ms}}, doc says for example:
> {quote}The maximum time before a new log segment is rolled out (in 
> milliseconds). If not set, the value in log.roll.hours is used
> {quote}
> In other parts (see 
> [https://kafka.apache.org/10/documentation.html#upgrade_10_1_breaking]), it 
> says:
> {quote}The log rolling time is no longer depending on log segment create 
> time. Instead it is now based on the timestamp in the messages. More 
> specifically. if the timestamp of the first message in the segment is T, the 
> log will be rolled out when a new message has a timestamp greater than or 
> equal to T + log.roll.ms
> {quote}
> which is wrong. More specifically, the wrong part is:
> {quote}if the timestamp of the +first+ message in the segment is T
> {quote}
> Indeed, the truth is actually:
> {quote}if the timestamp of the +last+ message in the segment is T
> {quote}
>  
> A simple use case to reproduce this is to configure a single broker with:
> {code:java}
> # One partition ... or any small number should be fine
> num.partitions=1
> # 100MB segment
> log.segment.bytes=1073741824
> # Delete old segments when their last addition is 24h old
> log.retention.hours=24
> # Check age of segments every 5 minutes
> log.retention.check.interval.ms=300000
> # Every hour (?!?!?), roll a new segment
> log.roll.hours=1
> {code}
> and loop on sending a small message (a few bytes so that you never reach 
> 100MB during the period of this test) every minute to one topic.
> After at least 24h running, according to what is described in the doc, on 
> would expect to see ~24 segments (on new segment rolled every hour).
>  But the truth is that there is only one log segment with all the records you 
> sent. Stop the producer for a bit more than one hour and restart it: you will 
> have a second segment created per partition because at some point, when 
> adding a new record, the previous one (the last one of what was the current 
> segment) was more than 1h old.
> This proves that the doc should say:
> {quote}if the timestamp of the +last+ message in the segment is T, the log 
> will be rolled out when a new message has a timestamp greater than or equal 
> to T + log.roll.ms
> {quote}
>  
> Notes:
>  * as a DevOps, I would prefer the doc to be true and kafka's behavior to be 
> changed. But I think that both should be done: doc updated to let users of 
> current versions know what to expect (and avoid running into the problem we 
> faced) and later the behavior of kafka updated. Indeed, one could have kafka 
> keep very old records with default conf where {{log.roll.hours=168}} and 
> {{log.segment.bytes=1073741824}} and when pushing like one small (~1k) record 
> a day -> 100k records can fit in that segment -> it is never rotated
>  * I detected this on version 1.0.0 but assume it impacts much more than that 
> version (and very likely 1.1.0 too)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to