Kevin Li created KAFKA-7331:
-------------------------------
Summary: Kafka does not detect broker loss in the event of a
network partition within the cluster
Key: KAFKA-7331
URL: https://issues.apache.org/jira/browse/KAFKA-7331
Project: Kafka
Issue Type: Bug
Components: controller, network
Affects Versions: 1.0.1
Reporter: Kevin Li
We ran into this issue on our production cluster and had to manually remove the
broker and enable unclean leader elections to get the cluster working again.
Ideally, Kafka itself could handle network partitions without manual
intervention.
The issue is reproducible with the following cross datacenter Kafka cluster
setup:
DC 1: Kafka brokers + ZK nodes
DC 2: Kafka brokers + ZK nodes
DC 3: Kafka brokers + ZK nodes
Introduce a network partition on a Kafka broker (brokerA) in DC 1 where it
cannot reach any hosts (brokers and ZK nodes) in the other 2 datacenters. The
cluster goes into a state where partitions that brokerA is a leader for will
only contain brokerA in its ISR. Since brokerA is still reachable by ZK nodes
in DC 1, it still shows up when querying ZK. The controller thinks brokerA is
still up and does not elect new leaders for partitions that brokerA is a leader
for. This causes all those partitions to be down until brokerA is back or
completely removed from the cluster (in which case unclean leader election can
elect new leaders for those partitions).
A faster recovery scenario could be for a majority of hosts (zk nodes?) to
realize that brokerA is unreachable, and mark it as down so elections for
partitions it is a leader for could be triggered. This avoids waiting
indefinitely for the broker to come back or taking action to remove the broker
from the cluster.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)