[ https://issues.apache.org/jira/browse/KAFKA-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tom Lee updated KAFKA-2193: --------------------------- Description: Our Kafka cluster recently experienced some intermittent network & DNS resolution issues such that this call to connect to Zookeeper failed with an UnknownHostException: https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkConnection.java#L67 We observed this happen during a processStateChanged(KeeperState.Expired) call: https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkClient.java#L649 the session expiry was in turn caused by what we suspect to be intermittent network issues. The failed ZK reconnect seemed to put ZkClient into a state where it would never recover and the Kafka broker into a state where it would need a restart to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to automatically try to make further attempts to reconnect after the failure, and obviously no further state transitions seem likely to happen without a connection to ZK. The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification when this occurs, so the brokers have an opportunity to handle this sort of failure in a more graceful manner (e.g. by trying to reconnect after some backoff period): https://github.com/sgroschupf/zkclient/blob/0.4.0/src/main/java/org/I0Itec/zkclient/ZkClient.java#L461 Happy to provide more info here if I can. was: Our Kafka cluster recently experienced some intermittent network & DNS resolution issues such that this call to connect to Zookeeper failed with an UnknownHostException: (https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkConnection.java#L67) We observed this happen during a processStateChanged(KeeperState.Expired) call: https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkClient.java#L649 the session expiry was in turn caused by what we suspect to be intermittent network issues. The failed ZK reconnect seemed to put ZkClient into a state where it would never recover and the Kafka broker into a state where it would need a restart to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to automatically try to make further attempts to reconnect after the failure, and obviously no further state transitions seem likely to happen without a connection to ZK. The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification when this occurs, so the brokers have an opportunity to handle this sort of failure in a more graceful manner (e.g. by trying to reconnect after some backoff period): https://github.com/sgroschupf/zkclient/blob/0.4.0/src/main/java/org/I0Itec/zkclient/ZkClient.java#L461 Happy to provide more info here if I can. > Intermittent network + DNS issues can cause brokers to permanently drop out > of a cluster > ---------------------------------------------------------------------------------------- > > Key: KAFKA-2193 > URL: https://issues.apache.org/jira/browse/KAFKA-2193 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.8.1.1 > Reporter: Tom Lee > Labels: broker > > Our Kafka cluster recently experienced some intermittent network & DNS > resolution issues such that this call to connect to Zookeeper failed with an > UnknownHostException: > https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkConnection.java#L67 > We observed this happen during a processStateChanged(KeeperState.Expired) > call: > https://github.com/sgroschupf/zkclient/blob/0630c9c6e67ab49a51e80bfd939e4a0d01a69dfe/src/main/java/org/I0Itec/zkclient/ZkClient.java#L649 > the session expiry was in turn caused by what we suspect to be intermittent > network issues. > The failed ZK reconnect seemed to put ZkClient into a state where it would > never recover and the Kafka broker into a state where it would need a restart > to rejoin the cluster: ZkConnection._zk == null, 0.3.x doesn't appear to > automatically try to make further attempts to reconnect after the failure, > and obviously no further state transitions seem likely to happen without a > connection to ZK. > The newer zkclient 0.4.0/0.5.0 releases will helpfully fire a notification > when this occurs, so the brokers have an opportunity to handle this sort of > failure in a more graceful manner (e.g. by trying to reconnect after some > backoff period): > https://github.com/sgroschupf/zkclient/blob/0.4.0/src/main/java/org/I0Itec/zkclient/ZkClient.java#L461 > Happy to provide more info here if I can. -- This message was sent by Atlassian JIRA (v6.3.4#6332)