[ 
https://issues.apache.org/jira/browse/KAFKA-9613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897635#comment-17897635
 ] 

Teddy Yan edited comment on KAFKA-9613 at 11/12/24 4:20 PM:
------------------------------------------------------------

Yes, it's a disk issue. I met this problem too.

I did a test on this hardware issue. I found an interesting problem with Kafka. 
It seems that the replica itself is a kind of consumer. If there is a problem 
with the leader's disk and it is stuck, other replicas will not get the data 
neither. The entire system stops working. The leader will not send data 
directly to the replica, but will also place it on disk first. If there is a 
problem with the placement, everything will not work. If the disk is damaged, 
we can only wait for retention. All data in the meantime will be lost.

Stopping the world might help operator to notice the problem, but we still loss 
data before manually recover. My question is whether we can have some 
configuration to skip the error logs ASAP and let partition back to work 
instead of waiting for retention timeout. 


was (Author: JIRAUSER307572):
Yes, it's a disk issue. I met this problem too.

I did a test on this hardware issue. I found an interesting problem with Kafka. 
It seems that the replica itself is a kind of consumer. If there is a problem 
with the leader's disk and it is stuck, other replicas will not get the data 
neither. The entire system stops working. The leader will not send data 
directly to the replica, but will also place it on disk first. If there is a 
problem with the placement, everything will not work. If the disk is damaged, 
we can only wait for retention. All data in the meantime will be lost.

Stopping the world might help operator to notice the problem, but we still loss 
data before manually recover. My question is whether we can have some 
configuration to skip the log ASAP back to work not waiting for retention 
timeout. 

> CorruptRecordException: Found record size 0 smaller than minimum record 
> overhead
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-9613
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9613
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.6.2
>            Reporter: Amit Khandelwal
>            Assignee: hudeqi
>            Priority: Major
>
> 20200224;21:01:38: [2020-02-24 21:01:38,615] ERROR [ReplicaManager broker=0] 
> Error processing fetch with max size 1048576 from consumer on partition 
> SANDBOX.BROKER.NEWORDER-0: (fetchOffset=211886, logStartOffset=-1, 
> maxBytes=1048576, currentLeaderEpoch=Optional.empty) 
> (kafka.server.ReplicaManager)
> 20200224;21:01:38: org.apache.kafka.common.errors.CorruptRecordException: 
> Found record size 0 smaller than minimum record overhead (14) in file 
> /data/tmp/kafka-topic-logs/SANDBOX.BROKER.NEWORDER-0/00000000000000000000.log.
> 20200224;21:05:48: [2020-02-24 21:05:48,711] INFO [GroupMetadataManager 
> brokerId=0] Removed 0 expired offsets in 1 milliseconds. 
> (kafka.coordinator.group.GroupMetadataManager)
> 20200224;21:10:22: [2020-02-24 21:10:22,204] INFO [GroupCoordinator 0]: 
> Member 
> xxxxxxxx_011-9e61d2c9-ce5a-4231-bda1-f04e6c260dc0-StreamThread-1-consumer-27768816-ee87-498f-8896-191912282d4f
>  in group yyyyyyyyy_011 has failed, removing it from the group 
> (kafka.coordinator.group.GroupCoordinator)
>  
> [https://stackoverflow.com/questions/60404510/kafka-broker-issue-replica-manager-with-max-size#]
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to