Deployed Kafka several weeks ago. So far cannot get the cluster to stay up consistently for more than a day!!
Every time, it works fine for several hours. I can send and receive messages with producers and consumers. Eventually within a day, the active controller count goes to 2 and then stays there until I restart the Kafka cluster. Setup 1) 3 zookeeper docker images on one host. 2) 3 kafka docker images on different single host. Our traffic is very very low right now, so we did not want to pay for 6 different hosts. Timline (see logs at the end): 1) Broker 3 is active controller after restart on 06/18. 2) Everything works fine for at least 12+ hours. 3) I stopped seeing controller.log from broker 3 from 06:35:28 UTC on 06/19! From broker 3 server.log, I see that starting 06:30 UTC on 06/19, broker 3 is no longer able to talk to Zookeeper. 4) After almost 6 hours from above, at 12:31:25 UTC on 06/19, broker 2 also becomes active controller. Broker 3 somehow continues to stay controller also based on metrics. Questions: 1) Why would the setup work fine for hours and consistently get into this state. Is there some issue on Zookeeper configuration side I should look into since connections to zookeeper are getting reset? Any other things to check? 2) If there is some networking issue and controller broker is unhealthy, shouldn't at least system stabilize with other broker becoming controller and unhealthy getting kicked out? Why would active controller count never reset to 1? 3) Any other thoughts? Thanks in advance! *broker2 controller.log:* [2018-06-18 15:50:21,159] INFO [controller-event-thread]: Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2018-06-18 15:50:21,200] DEBUG [Controller id=2] Broker 3 has been elected as the controller, so stopping the election process. (kafka.controller.KafkaController) [2018-06-19 12:31:25,426] INFO [controller-event-thread]: Starting (kafka.controller.ControllerEventManager$ControllerEventThread) [2018-06-19 12:31:25,770] INFO [Controller id=2] 2 successfully elected as the controller (kafka.controller.KafkaController) [2018-06-19 12:31:25,771] INFO [Controller id=2] Starting become controller state transition (kafka.controller.KafkaController) [2018-06-19 12:31:25,810] INFO [Controller id=2] Initialized controller epoch to 44 and zk version 43 (kafka.controller.KafkaController) [2018-06-19 12:31:26,171] INFO [Controller id=2] Incremented epoch to 45 (kafka.controller.KafkaController) [2018-06-19 12:31:26,847] DEBUG [Controller id=2] Registering IsrChangeNotificationListener (kafka.controller.KafkaController) [2018-06-19 12:31:26,940] DEBUG [Controller id=2] Registering logDirEventNotificationListener (kafka.controller.KafkaController) [2018-06-19 12:31:27,487] INFO [Controller id=2] Partitions being reassigned: Map() (kafka.controller.KafkaController) [2018-06-19 12:31:27,547] INFO [Controller id=2] Partitions already reassigned: Set() (kafka.controller.KafkaController) [2018-06-19 12:31:27,548] INFO [Controller id=2] Resuming reassignment of partitions: Map() (kafka.controller.KafkaController) [2018-06-19 12:31:27,554] INFO [Controller id=2] Currently active brokers in the cluster: Set() (kafka.controller.KafkaController) [2018-06-19 12:31:27,555] INFO [Controller id=2] Currently shutting brokers in the cluster: Set() (kafka.controller.KafkaController) *broker3 controller.log (nothing after last line):* [2018-06-19 06:35:28,729] DEBUG [Controller id=3] Topics not in preferred replica Map() (kafka.controller.KafkaController) [2018-06-19 06:35:28,729] TRACE [Controller id=3] Leader imbalance ratio for broker 2 is 0.0 (kafka.controller.KafkaController) [2018-06-19 06:35:28,729] DEBUG [Controller id=3] Topics not in preferred replica Map() (kafka.controller.KafkaController) [2018-06-19 06:35:28,729] TRACE [Controller id=3] Leader imbalance ratio for broker 1 is 0.0 (kafka.controller.KafkaController) [2018-06-19 06:35:28,729] DEBUG [Controller id=3] Topics not in preferred replica Map() (kafka.controller.KafkaController) [2018-06-19 06:35:28,729] TRACE [Controller id=3] Leader imbalance ratio for broker 3 is 0.0 (kafka.controller.KafkaController) *broker3 server.log (zookeeper client timeouts):* [2018-06-19 06:30:03,673] WARN Client session timed out, have not heard from server in 4215ms for sessionid 0x363b74754d10026 (org.apache.zookeeper.ClientCnxn) [2018-06-19 06:30:12,940] INFO Client session timed out, have not heard from server in 4215ms for sessionid 0x363b74754d10026, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2018-06-19 06:30:20,251] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2018-06-19 06:30:20,925] INFO [GroupMetadataManager brokerId=3] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) [2018-06-19 06:30:39,646] INFO Opening socket connection to server ip-172-31-26-177.us-west-2.compute.internal/172.31.26.177:32181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2018-06-19 06:30:57,159] INFO Socket connection established to ip-172-31-26-177.us-west-2.compute.internal/172.31.26.177:32181, initiating session (org.apache.zookeeper.ClientCnxn) [2018-06-19 06:32:12,477] WARN Client session timed out, have not heard from server in 68844ms for sessionid 0x363b74754d10026 (org.apache.zookeeper.ClientCnxn)