Ari Uka commented on KAFKA-6679:

When I run `/usr/local/share/kafka_2.12-1.0.1/bin/kafka-run-class.sh 
kafka.tools.DumpLogSegments --files` on the leader of the partition, I get an 
exception and the dump seems to stop early. 

I wanted to dump some of the messages and check if they were corrupt, but the 
segments won't even dump properly.

````Exception in thread "main" 
org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
than minimum record overhead (14).````

What is this from?

> Random corruption (CRC validation issues) 
> ------------------------------------------
>                 Key: KAFKA-6679
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6679
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer, replication
>    Affects Versions:, 1.0.1
>         Environment: FreeBSD 11.0-RELEASE-p8
>            Reporter: Ari Uka
>            Priority: Major
> I'm running into a really strange issue on production. I have 3 brokers and 
> randomly consumers will start to fail with an error message saying the CRC 
> does not match. The brokers are all on 1.0.1, but the issue started on 0.10.2 
> with the hope that upgrading would help fix the issue.
> On the kafka side, I see errors related to this across all 3 brokers:
> ```
> [2018-03-17 20:59:58,967] ERROR [ReplicaFetcher replicaId=3, leaderId=1, 
> fetcherId=0] Error for partition topic-a-0 to broker 
> 1:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14).
> [2018-03-17 20:59:59,411] ERROR [ReplicaManager broker=3] Error processing 
> fetch operation on partition topic-b-0, offset 23848795 
> (kafka.server.ReplicaManager)
> org.apache.kafka.common.errors.CorruptRecordException: Record size is smaller 
> than minimum record overhead (14)
> [2018-03-17 20:59:59,490] ERROR [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error for partition topic-c-2 to broker 
> 2:org.apache.kafka.common.errors.CorruptRecordException: This message has 
> failed its CRC checksum, exceeds the valid size, or is otherwise corrupt. 
> (kafka.server.ReplicaFetcherThread)
> ```
> To fix this, I have to use the kafka-consumer-groups.sh command line tool and 
> do a binary search until I can find a non corrupt message and push the 
> offsets forward. It's annoying because I can't actually push to a specific 
> date because kafka-consumer-groups.sh starts to emit the same error, 
> ErrInvalidMessage, CRC does not match.
> The error popped up again the next day after fixing it tho, so I'm trying to 
> find the root cause. 
> I'm using the Go consumer [https://github.com/Shopify/sarama] and 
> [https://github.com/bsm/sarama-cluster]. 
> At first, I thought it could be the consumer libraries, but the error happens 
> with kafka-console-consumer.sh as well when a specific message is corrupted 
> in Kafka. I don't think it's possible for Kafka producers to actually push 
> corrupt messages to Kafka and then cause all consumers to break right? I 
> assume Kafka would reject corrupt messages, so I'm not sure what's going on 
> here.
> Should I just re-create the cluster, I don't think it's hardware failure 
> across the 3 machines tho.

This message was sent by Atlassian JIRA

Reply via email to