Tom Lee created KAFKA-2193:
------------------------------

             Summary: Intermittent network + DNS issues can cause brokers to 
permanently drop out of a cluster
                 Key: KAFKA-2193
                 URL: https://issues.apache.org/jira/browse/KAFKA-2193
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.8.1.1
            Reporter: Tom Lee


Our Kafka cluster recently experienced some intermittent network & DNS 
resolution issues such that this call to connect to Zookeeper failed with an 
UnknownHostException:

(https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkConnection.java#L67)

We observed this happen during a processStateChanged(KeeperState.Expired) call:

https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkClient.java#L649

the session expiry was in turn caused by what we suspect to be intermittent 
network issues.

The failed ZK reconnect seemed to put ZkClient into a state where it would 
never recover and the Kafka broker into a state where it would need a restart 
to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to 
automatically try to make further attempts to reconnect after the failure, and 
obviously no further state transitions seem likely to happen without a 
connection to ZK.

The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification when 
this occurs, so the brokers have an opportunity to handle this sort of failure 
in a more graceful manner (e.g. by trying to reconnect after some backoff 
period):

https://github.com/sgroschupf/zkclient/blob/0.4.0/src/main/java/org/I0Itec/zkclient/ZkClient.java#L461

Happy to provide more info here if I can.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to