[ https://issues.apache.org/jira/browse/KAFKA-5431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047637#comment-16047637 ]
huxihx commented on KAFKA-5431: ------------------------------- Could you run command below to see whether there exists a corrupt record for `__consumer_offsets` topic? {noformat} bin/kafka-simple-consumer-shell.sh --topic __consumer_offsets --partition 18 --broker-list *** --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" bin/kafka-simple-consumer-shell.sh --topic __consumer_offsets --partition 24 --broker-list *** --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" {noformat} > LogCleaner stopped due to > org.apache.kafka.common.errors.CorruptRecordException > ------------------------------------------------------------------------------- > > Key: KAFKA-5431 > URL: https://issues.apache.org/jira/browse/KAFKA-5431 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.2.1 > Reporter: Carsten Rietz > Labels: reliability > > Hey all, > i have a strange problem with our uat cluster of 3 kafka brokers. > the __consumer_offsets topic was replicated to two instances and our disks > ran full due to a wrong configuration of the log cleaner. We fixed the > configuration and updated from 0.10.1.1 to 0.10.2.1 . > Today i increased the replication of the __consumer_offsets topic to 3 and > triggered replication to the third cluster via kafka-reassign-partitions.sh. > That went well but i get many errors like > {code} > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,18] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > [2017-06-12 09:59:50,342] ERROR Found invalid messages during fetch for > partition [__consumer_offsets,24] offset 0 error Record size is less than the > minimum record overhead (14) (kafka.server.ReplicaFetcherThread) > {code} > Which i think are due to the full disk event. > The log cleaner threads died on these wrong messages: > {code} > [2017-06-12 09:59:50,722] ERROR [kafka-log-cleaner-thread-0], Error due to > (kafka.log.LogCleaner) > org.apache.kafka.common.errors.CorruptRecordException: Record size is less > than the minimum record overhead (14) > [2017-06-12 09:59:50,722] INFO [kafka-log-cleaner-thread-0], Stopped > (kafka.log.LogCleaner) > {code} > Looking at the file is see that some are truncated and some are jsut empty: > $ ls -lsh 00000000000000594653.log > 0 -rw-r--r-- 1 user user 100M Jun 12 11:00 00000000000000594653.log > Sadly i do not have the logs any more from the disk full event itsself. > I have three questions: > * What is the best way to clean this up? Deleting the old log files and > restarting the brokers? > * Why did kafka not handle the disk full event well? Is this only affecting > the cleanup or may we also loose data? > * Is this maybe caused by the combination of upgrade and disk full? > And last but not least: Keep up the good work. Kafka is really performing > well while being easy to administer and has good documentation! -- This message was sent by Atlassian JIRA (v6.4.14#64029)