Takao Kobayashi created KAFKA-6113:
--------------------------------------
Summary: broker failure leads to under replicated partitions
Key: KAFKA-6113
URL: https://issues.apache.org/jira/browse/KAFKA-6113
Project: Kafka
Issue Type: Bug
Affects Versions: 0.10.1.1
Reporter: Takao Kobayashi
Attachments: Screen Shot 2017-10-20 at 10.57.28 AM.png, kafka1.csv,
kafka2.csv, kafka3.csv, kafka4.csv, kafka5.csv, zookeeper2.csv
A similar issue to https://issues.apache.org/jira/browse/KAFKA-2729 but with
some slight differences: We're using a 5 kafka, 3 zookeeper node setup running
on kubernetes on aws. One node (5.kafka.production1) suddenly failed and was
offline for ~13min.
During the outage many partitions were under replicated. As soon as the node
came back online, all brokers recovered.
Looking through the logs, there were a bunch of partitions that failed to
shrink ISR (to remove the failed broker) since the cached zkVersion on the
kafka node was not equal to that in zookeeper (screenshot of one such example
is attached)
I've attached the logs for all the kafka nodes and one of the zookeeper nodes.
Any advice or insight would be much appreciate
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)