Kevin Li created KAFKA-7331:
-------------------------------

             Summary: Kafka does not detect broker loss in the event of a 
network partition within the cluster
                 Key: KAFKA-7331
                 URL: https://issues.apache.org/jira/browse/KAFKA-7331
             Project: Kafka
          Issue Type: Bug
          Components: controller, network
    Affects Versions: 1.0.1
            Reporter: Kevin Li


We ran into this issue on our production cluster and had to manually remove the 
broker and enable unclean leader elections to get the cluster working again. 
Ideally, Kafka itself could handle network partitions without manual 
intervention.

The issue is reproducible with the following cross datacenter Kafka cluster 
setup:
DC 1: Kafka brokers + ZK nodes
DC 2: Kafka brokers + ZK nodes
DC 3: Kafka brokers + ZK nodes

Introduce a network partition on a Kafka broker (brokerA) in DC 1 where it 
cannot reach any hosts (brokers and ZK nodes) in the other 2 datacenters. The 
cluster goes into a state where partitions that brokerA is a leader for will 
only contain brokerA in its ISR. Since brokerA is still reachable by ZK nodes 
in DC 1, it still shows up when querying ZK. The controller thinks brokerA is 
still up and does not elect new leaders for partitions that brokerA is a leader 
for. This causes all those partitions to be down until brokerA is back or 
completely removed from the cluster (in which case unclean leader election can 
elect new leaders for those partitions).

A faster recovery scenario could be for a majority of hosts (zk nodes?) to 
realize that brokerA is unreachable, and mark it as down so elections for 
partitions it is a leader for could be triggered. This avoids waiting 
indefinitely for the broker to come back or taking action to remove the broker 
from the cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to