[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046650#comment-16046650 ]
Jun Rao commented on KAFKA-5431: -------------------------------- Interesting. First, in general, if there is an IOException during writing to the log, the broker will shut down immediately. Second, if the cleaner hits an IOException, currently, we just abort the current cleaning job. The existing log should be intact. The "invalid message" from the fetch follower seems a bit weird. That suggests that the log segment at offset 0 is corrupted somehow. Could you use the DumpLogSegment tool (https://cwiki.apache.org/confluence/display/KAFKA/System+Tools#SystemTools-DumpLogSegment) on that segment in the leader to see if there is any log corruption? > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > ------------------------------------------------------------------------------- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.2.1 > Reporter: Carsten Rietz > Labels: reliability > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00000000000000594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00000000000000594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian JIRA (v6.4.14#64029)