Recently, we found the serious ZkClient bug, actual Apache Zookeeper client bug, which can bring down broker/consumer on zookeeper push.
We're running kafka and zookeeeper in AWS EC2 environment. Zookeeper instances are bound with EIP to give the static hostname for each instance, which means even if the EC2 instance is terminated and replaced with the new one, it will have the same hostname but its private IP bound to the hostname can be changed. The scenario is, if we do rolling push all zookeeper server instances by terminating and waiting until the new instance joins to the quorum one by one, finally, ZkClient will try to connect to the old IP addresses which do not exist any more due to DNS caching on Apache Zookeeper client side, please refer to https://issues.apache.org/jira/browse/ZOOKEEPER-338 So, we need to restart kafka brokers and consumers to refresh DNS cache. To solve this problem, I sent the following pull request to ZkClient, https://github.com/sgroschupf/zkclient/pull/26 Please review the above PR. If new version of ZkClient with the following fix is not released on the schedule of kafka 0.8.2 release, I'd like kafka to ship the internally built ZkClient with the fix. I will really appreciate. Thank you Best, Jae