Christopher Auston created KAFKA-13077:
------------------------------------------

             Summary: Replication failing after unclean shutdown of ZK and all 
brokers
                 Key: KAFKA-13077
                 URL: https://issues.apache.org/jira/browse/KAFKA-13077
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 2.8.0
            Reporter: Christopher Auston


I am submitting this in the spirit of what can go wrong when an operator 
violates the constraints Kafka depends on. I don't know if Kafka could or 
should handle this more gracefully. I decided to file this issue because it was 
easy to get the problem I'm reporting with Kubernetes StatefulSets (STS). By 
"easy" I mean that I did not go out of my way to corrupt anything, I just was 
not careful when restarting ZK and brokers.

I violated the constraints of keeping Zookeeper stable and at least one running 
in-sync replica. 

I am running the bitnami/kafka helm chart on Amazon EKS.
{quote}% kubectl get po kaf-kafka-0 -ojson |jq .spec.containers'[].image'
"docker.io/bitnami/kafka:2.8.0-debian-10-r43"
{quote}
I started with 3 ZK instances and 3 brokers (both STS). I changed the 
cpu/memory requests on both STS and kubernetes proceeded to restart ZK and 
kafka instances at the same time. If I recall correctly there were some crashes 
and several restarts but eventually all the instances were running again. It's 
possible all ZK nodes and all brokers were unavailable at various points.

The problem I noticed was that two of the brokers were just continually 
spitting out messages like:
{quote}% kubectl logs kaf-kafka-0 --tail 10
[2021-07-13 14:26:08,871] INFO [ProducerStateManager 
partition=__transaction_state-0] Loading producer state from snapshot file 
'SnapshotFile(/bitnami/kafka/data/__transaction_state-0/00000000000000000001.snapshot,1)'
 (kafka.log.ProducerStateManager)
[2021-07-13 14:26:08,871] WARN [Log partition=__transaction_state-0, 
dir=/bitnami/kafka/data] *Non-monotonic update of high watermark from 
(offset=2744 segment=[0:1048644]) to (offset=1 segment=[0:169])* (kafka.log.Log)
[2021-07-13 14:26:08,874] INFO [Log partition=__transaction_state-10, 
dir=/bitnami/kafka/data] Truncating to offset 2 (kafka.log.Log)
[2021-07-13 14:26:08,877] INFO [Log partition=__transaction_state-10, 
dir=/bitnami/kafka/data] Loading producer state till offset 2 with message 
format version 2 (kafka.log.Log)
[2021-07-13 14:26:08,877] INFO [ProducerStateManager 
partition=__transaction_state-10] Loading producer state from snapshot file 
'SnapshotFile(/bitnami/kafka/data/__transaction_state-10/00000000000000000002.snapshot,2)'
 (kafka.log.ProducerStateManager)
[2021-07-13 14:26:08,877] WARN [Log partition=__transaction_state-10, 
dir=/bitnami/kafka/data] Non-monotonic update of high watermark from 
(offset=2930 segment=[0:1048717]) to (offset=2 segment=[0:338]) (kafka.log.Log)
[2021-07-13 14:26:08,880] INFO [Log partition=__transaction_state-20, 
dir=/bitnami/kafka/data] Truncating to offset 1 (kafka.log.Log)
[2021-07-13 14:26:08,882] INFO [Log partition=__transaction_state-20, 
dir=/bitnami/kafka/data] Loading producer state till offset 1 with message 
format version 2 (kafka.log.Log)
[2021-07-13 14:26:08,882] INFO [ProducerStateManager 
partition=__transaction_state-20] Loading producer state from snapshot file 
'SnapshotFile(/bitnami/kafka/data/__transaction_state-20/00000000000000000001.snapshot,1)'
 (kafka.log.ProducerStateManager)
[2021-07-13 14:26:08,883] WARN [Log partition=__transaction_state-20, 
dir=/bitnami/kafka/data] Non-monotonic update of high watermark from 
(offset=2956 segment=[0:1048608]) to (offset=1 segment=[0:169]) (kafka.log.Log)
{quote}
If I describe that topic I can see that several partitions have a leader of 2 
and the ISR is just 2 (NOTE I added two more brokers and tried to reassign the 
topic onto brokers 2,3,4 which you can see below). The new brokers also spit 
out the messages about "non-monotonic update" just like the original followers. 
This describe output is from the following day.

{{% kafka-topics.sh ${=BS} -topic __transaction_state -describe}}
{{Topic: __transaction_state TopicId: i7bBNCeuQMWl-ZMpzrnMAw PartitionCount: 50 
ReplicationFactor: 3 Configs: 
compression.type=uncompressed,min.insync.replicas=3,cleanup.policy=compact,flush.ms=1000,segment.bytes=104857600,flush.messages=10000,max.message.bytes=1000012,unclean.leader.election.enable=false,retention.bytes=1073741824}}
{{ Topic: __transaction_state Partition: 0 Leader: 2 Replicas: 4,3,2,1,0 Isr: 2 
Adding Replicas: 4,3 Removing Replicas: 1,0}}
{{ Topic: __transaction_state Partition: 1 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 2 Leader: 3 Replicas: 3,2,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 3 Leader: 4 Replicas: 4,2,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 4 Leader: 2 Replicas: 2,3,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 5 Leader: 2 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 6 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 7 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 8 Leader: 2 Replicas: 3,2,4,0,1 Isr: 2 
Adding Replicas: 3,4 Removing Replicas: 0,1}}
{{ Topic: __transaction_state Partition: 9 Leader: 2 Replicas: 4,2,3,1,0 Isr: 2 
Adding Replicas: 4,3 Removing Replicas: 1,0}}
{{ Topic: __transaction_state Partition: 10 Leader: 2 Replicas: 2,3,4,1,0 Isr: 
2 Adding Replicas: 3,4 Removing Replicas: 1,0}}
{{ Topic: __transaction_state Partition: 11 Leader: 3 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 12 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 13 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 14 Leader: 3 Replicas: 3,2,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 15 Leader: 4 Replicas: 4,2,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 16 Leader: 2 Replicas: 2,3,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 17 Leader: 2 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 18 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 19 Leader: 2 Replicas: 2,4,3,0,1 Isr: 
2 Adding Replicas: 4,3 Removing Replicas: 0,1}}
{{ Topic: __transaction_state Partition: 20 Leader: 2 Replicas: 3,2,4,0,1 Isr: 
2 Adding Replicas: 3,4 Removing Replicas: 0,1}}
{{ Topic: __transaction_state Partition: 21 Leader: 2 Replicas: 4,2,3,1,0 Isr: 
2 Adding Replicas: 4,3 Removing Replicas: 1,0}}
{{ Topic: __transaction_state Partition: 22 Leader: 2 Replicas: 2,3,4,1,0 Isr: 
2 Adding Replicas: 3,4 Removing Replicas: 1,0}}
{{ Topic: __transaction_state Partition: 23 Leader: 3 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 24 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 25 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 26 Leader: 3 Replicas: 3,2,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 27 Leader: 4 Replicas: 4,2,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 28 Leader: 2 Replicas: 2,3,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 29 Leader: 3 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 30 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 31 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 32 Leader: 3 Replicas: 3,2,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 33 Leader: 4 Replicas: 4,2,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 34 Leader: 2 Replicas: 2,3,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 35 Leader: 3 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 36 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 37 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 38 Leader: 3 Replicas: 3,2,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 39 Leader: 4 Replicas: 4,2,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 40 Leader: 2 Replicas: 2,3,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 41 Leader: 3 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 42 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 43 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 44 Leader: 3 Replicas: 3,2,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 45 Leader: 4 Replicas: 4,2,3 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 46 Leader: 2 Replicas: 2,3,4 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 47 Leader: 3 Replicas: 3,4,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 48 Leader: 4 Replicas: 4,3,2 Isr: 
2,3,4}}
{{ Topic: __transaction_state Partition: 49 Leader: 2 Replicas: 2,4,3 Isr: 
2,3,4}}

 

It seems something got corrupted and the followers will never make progress. 
Even worse the original followers appear to have truncated their copies, so if 
the remaining leader replica is what is corrupted then it may have truncated 
replicas that had more valid data?

Anyway, for what it's worth, this is something that happened to me. I plan to 
change the statefulsets to require manual restarts so I can control rolling 
upgrades. It also seems to underscore having a separate Kafka cluster for 
disaster recovery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to