[ 
https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897803#comment-17897803
 ] 

Chia-Ping Tsai commented on KAFKA-9613:
---------------------------------------

{quote}
I did a test on this hardware issue. I found an interesting problem with Kafka. 
It seems that the replica itself is a kind of consumer. If there is a problem 
with the leader's disk and it is stuck, other replicas will not get the data 
neither. The entire system stops working. The leader will not send data 
directly to the replica, but will also place it on disk first. If there is a 
problem with the placement, everything will not work. If the disk is damaged, 
we can only wait for retention. All data in the meantime will be lost.
{quote}

If a follower cannot sync from the leader for any reason, it should become an 
out-of-sync replica. To prevent additional records from being written to the 
topic and to mitigate data loss, you should configure min.insync.replicas on 
the topic.

{quote}
Stopping the world might help operator to notice the problem, but we still loss 
data before manually recover. My question is whether we can have some 
configuration to skip the error logs ASAP and let partition back to work 
instead of waiting for retention timeout. 
{quote}

If the server automatically skips corrupted records and continues sending the 
remaining records, it may result in inconsistent ordering when the records are 
recovered. Therefore, data transmission should be halted until the root cause 
is identified and addressed by a human.

If you determine that the corrupted records can be safely discarded, you can 
use Admin.deleteRecords to adjust the start offset, effectively skipping the 
corrupted records during reading.



> CorruptRecordException: Found record size 0 smaller than minimum record 
> overhead
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-9613
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9613
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.6.2
>            Reporter: Amit Khandelwal
>            Assignee: hudeqi
>            Priority: Major
>
> 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] 
> Error processing fetch with max size 1048576 from consumer on partition 
> SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, 
> maxBytes=1048576, currentLeaderEpoch=Optional.empty) 
> (kafka.server.ReplicaManager)
> 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: 
> Found record size 0 smaller than minimum record overhead (14) in file 
> /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/00000000000000000000.log.
> 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager 
> brokerId=0] Removed 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: 
> Member 
> xxxxxxxx_011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f
>  in group yyyyyyyyy_011 has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
>  
> [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to