[ https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897803#comment-17897803 ]
Chia-Ping Tsai commented on KAFKA-9613: --------------------------------------- {quote} I did a test on this hardware issue. I found an interesting problem with Kafka. It seems that the replica itself is a kind of consumer. If there is a problem with the leader's disk and it is stuck, other replicas will not get the data neither. The entire system stops working. The leader will not send data directly to the replica, but will also place it on disk first. If there is a problem with the placement, everything will not work. If the disk is damaged, we can only wait for retention. All data in the meantime will be lost. {quote} If a follower cannot sync from the leader for any reason, it should become an out-of-sync replica. To prevent additional records from being written to the topic and to mitigate data loss, you should configure min.insync.replicas on the topic. {quote} Stopping the world might help operator to notice the problem, but we still loss data before manually recover. My question is whether we can have some configuration to skip the error logs ASAP and let partition back to work instead of waiting for retention timeout. {quote} If the server automatically skips corrupted records and continues sending the remaining records, it may result in inconsistent ordering when the records are recovered. Therefore, data transmission should be halted until the root cause is identified and addressed by a human. If you determine that the corrupted records can be safely discarded, you can use Admin.deleteRecords to adjust the start offset, effectively skipping the corrupted records during reading. > CorruptRecordException: Found record size 0 smaller than minimum record > overhead > -------------------------------------------------------------------------------- > > Key: KAFKA-9613 > URL: https://issues.apache.org/jira/browse/KAFKA-9613 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 2.6.2 > Reporter: Amit Khandelwal > Assignee: hudeqi > Priority: Major > > 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] > Error processing fetch with max size 1048576 from consumer on partition > SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, > maxBytes=1048576, currentLeaderEpoch=Optional.empty) > (kafka.server.ReplicaManager) > 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: > Found record size 0 smaller than minimum record overhead (14) in file > /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/00000000000000000000.log. > 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager > brokerId=0] Removed 0 expired offsets in 1 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: > Member > xxxxxxxx_011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f > in group yyyyyyyyy_011 has failed, removing it from the group > (kafka.coordinator.group.GroupCoordinator) > > [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)