[ 
https://issues.apache.org/jira/browse/KAFKA-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060456#comment-14060456
 ] 

Dmitry Bugaychenko commented on KAFKA-1539:
-------------------------------------------

This is not about log files themselves^ but about chekpoint offset files 

{code}
-rw-r--r--  1 root root   158 Jul 14 12:11 recovery-point-offset-checkpoint
-rw-r--r--  1 root root   163 Jul 14 12:11 replication-offset-checkpoint
-rw-r--r--  1 root root     0 May 28 13:09 cleaner-offset-checkpoint
{code}

If recovery-point-offset-checkpoint got corrupted, broker startup slows down 
dramatically (to hours), if replication-offset-checkpoint got corrupted, then 
broker removes all the data it has and starts recovering from other replicas. 
If both got corrupted then you get both - broker spending hours checking log 
segment files and then removeing them all.


> Due to OS caching Kafka might loose offset files which causes full reset of 
> data
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-1539
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1539
>             Project: Kafka
>          Issue Type: Bug
>          Components: log
>    Affects Versions: 0.8.1.1
>            Reporter: Dmitry Bugaychenko
>            Assignee: Jay Kreps
>
> Seen this while testing power failure and disk failures. Due to chaching on 
> OS level (eg. XFS can cache data for 30 seconds) after failure we got offset 
> files of zero length. This dramatically slows down broker startup (it have to 
> re-check all segments) and if high watermark offsets lost it simply erases 
> all data and start recovering from other brokers (looks funny - first 
> spending 2-3 hours re-checking logs and then deleting them all due to missing 
> high watermark).
> Proposal: introduce offset files rotation. Keep two version of offset file, 
> write to oldest, read from the newest valid. In this case we would be able to 
> configure offset checkpoint time in a way that at least one file is alway 
> flushed and valid.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to