On our production system I have observed problems with sending messages to kafka when one of the brokers gets taken down (controlled roll of an open-shift pod containing the kafka broker), the messages time out after 5 seconds (we have retries set up, but before the retries are exhausted, the wait of 5 seconds, using Future.get() call of 5000 milliseconds, for sending a message has expired). I have managed to reproduce this on my local machine with a cluster of 3 kafka 0.10.2.0 brokers, one zookeeper and our app sending messages to the brokers. It seems to only happen when the broker taken down is the broker which is the lead with the zookeeper elector. The election happens pretty quickly, but then the rebalance takes 10 or so seconds. If the broker that is taken down is not the zookeeper elector leader, the rebalance happens quickly. The long of the election is below for the 2 nodes that remained up, it shows that after election there is a 10 to 12 second time lag before the rebalance partitions happens. I have also included the controller log for the node that did the successful znode creation. logs2 [2017-06-13 14:14:28,273] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-06-13 14:14:28,275] INFO Result of znode creation is: OK (kafka.utils.ZKCheckedEphemeral) [2017-06-13 14:14:28,276] INFO 2 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2017-06-13 14:14:28,571] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-06-13 14:14:40,761] INFO Partition [__consumer_offsets,30] on broker 2: Shrinking ISR for partition [__consumer_offsets,30] from 2,0,3 to 2,0 (kafka.cluster.Partition) logs [2017-06-13 14:14:28,277] INFO Creating /controller (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-06-13 14:14:28,279] INFO Result of znode creation is: NODEEXISTS (kafka.utils.ZKCheckedEphemeral) [2017-06-13 14:14:28,284] INFO New leader is 2 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2017-06-13 14:14:38,518] INFO Partition [__consumer_offsets,17] on broker 0: Shrinking ISR for partition [__consumer_offsets,17] from 2,0,3 to 2,0 (kafka.cluster.Partition) |
controller.log.2017-06-13-14.gz
Description: GNU Zip compressed data
Are there any parameters that we can change to get this time lower, or are there any possible improvements that could get this time down. Thanks, Tom |