[jira] [Commented] (KAFKA-881) Kafka broker not respecting log.roll.hours

Jun Rao (JIRA) Fri, 03 May 2013 08:58:17 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648519#comment-13648519
 ]


Jun Rao commented on KAFKA-881:
-------------------------------

If we "don't reuse files after a restart", it may cause fragmentation of the 
log files since it will affect all logs, whether their retention is time-based 
(and how long the retention time is) or not. The second option seems better. 
Since the major development is on 0.8 now, I suggest that we patch this in 
trunk, instead of 0.7.
                
> Kafka broker not respecting log.roll.hours
> ------------------------------------------
>
>                 Key: KAFKA-881
>                 URL: https://issues.apache.org/jira/browse/KAFKA-881
>             Project: Kafka
>          Issue Type: Bug
>          Components: log
>    Affects Versions: 0.7.2
>            Reporter: Dan F
>            Assignee: Jay Kreps
>
> We are running Kafka 0.7.2. We set log.roll.hours=1. I hoped that meant logs 
> would be rolled every hour, or more. Only, sometimes logs that are many hours 
> (sometimes days) old have more data added to them. This perturbs our systems 
> for reasons I won't get in to.
> I don't know Scala or Kafka well, but I have proposal for why this might 
> happen: upon restart, a broker forgets when its log files have been appended 
> to ("firstAppendTime"). Then a potentially infinite amount of time later, the 
> restarted broker receives another message for the particular (topic, 
> partition), and starts the clock again. It will then roll over that log after 
> an hour.
> https://svn.apache.org/repos/asf/kafka/branches/0.7/core/src/main/scala/kafka/server/KafkaConfig.scala
>  says:
>   /* the maximum time before a new log segment is rolled out */
>   val logRollHours = Utils.getIntInRange(props, "log.roll.hours", 24*7, (1, 
> Int.MaxValue))
> https://svn.apache.org/repos/asf/kafka/branches/0.7/core/src/main/scala/kafka/log/Log.scala
>  has maybeRoll, which needs segment.firstAppendTime defined. It also has 
> updateFirstAppendTime() which says if it's empty, then set it.
> If my hypothesis is correct about why it is happening, here is a case where 
> rolling is longer than an hour, even on a high volume topic:
> - write to a topic for 20 minutes
> - restart the broker
> - wait for 5 days
> - write to a topic for 20 minutes
> - restart the broker
> - write to a topic for an hour
> The rollover time was now 5 days, 1 hour, 40 minutes. You can make it as long 
> as you want.
> Proposed solution:
> The very easiest thing to do would be to have Kafka re-initialized 
> firstAppendTime with the file creation time. Unfortunately, there is no file 
> creation time in UNIX. There is ctime, change time, updated when a file's 
> inode information is changed.
> One solution is to embed the firstAppendTime in the filename (say, seconds 
> since epoch). Then when you open it you could reset firstAppendTime to 
> exactly what it really was. This ignores clock drift or resetting. One could 
> set firstAppendTime to min(filename-based time, current time).
> A second solution is to make the Kafka log roll over at specific times, 
> regardless of when the file was created. Conceptually, time can be divided 
> into windows of size log.rollover.hours since epoch (UNIX time 0, 1970). So, 
> when firstAppendTime is empty, compute the next rollover time (say, next = 
> (hours since epoch) % (log.rollover.hours) + log.rollover.hours). If the file 
> mtime (last modified) is before the current rollover window ( 
> (next-log.rollover.hours) .. next ), roll it over right away. Otherwise, roll 
> over when you cross next, and reset next.
> A third solution (not perfect, but an approximation at least) would be to not 
> to write to a segment if firstAppendTime is not defined and the timestamp on 
> the file is more than log.roll.hours old.
> There are probably other solutions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-881) Kafka broker not respecting log.roll.hours

Reply via email to