Deployed Kafka several weeks ago. So far cannot get the cluster to stay up
consistently for more than a day!!

Every time, it works fine for several hours. I can send and receive
messages with producers and consumers. Eventually within a day, the active
controller count goes to 2 and then stays there until I restart the Kafka
cluster.

Setup
1) 3 zookeeper docker images on one host.
2) 3 kafka docker images on different single host. Our traffic is very very
low right now, so we did not want to pay for 6 different hosts.

Timline (see logs at the end):
1) Broker 3 is active controller after restart on 06/18.
2) Everything works fine for at least 12+ hours.
3) I stopped seeing controller.log from broker 3 from 06:35:28 UTC on
06/19! From broker 3 server.log, I see that starting 06:30 UTC on 06/19,
broker 3 is no longer able to talk to Zookeeper.
4) After almost 6 hours from above, at 12:31:25 UTC on 06/19, broker 2 also
becomes active controller. Broker 3 somehow continues to stay controller
also based on metrics.

Questions:
1) Why would the setup work fine for hours and consistently get into this
state. Is there some issue on Zookeeper configuration side I should look
into since connections to zookeeper are getting reset? Any other things to
check?
2) If there is some networking issue and controller broker is unhealthy,
shouldn't at least system stabilize with other broker becoming controller
and unhealthy getting kicked out? Why would active controller count never
reset to 1?
3) Any other thoughts?


Thanks in advance!


*broker2 controller.log:*
[2018-06-18 15:50:21,159] INFO [controller-event-thread]: Starting
(kafka.controller.ControllerEventManager$ControllerEventThread)
[2018-06-18 15:50:21,200] DEBUG [Controller id=2] Broker 3 has been elected
as the controller, so stopping the election process.
(kafka.controller.KafkaController)
[2018-06-19 12:31:25,426] INFO [controller-event-thread]: Starting
(kafka.controller.ControllerEventManager$ControllerEventThread)
[2018-06-19 12:31:25,770] INFO [Controller id=2] 2 successfully elected as
the controller (kafka.controller.KafkaController)
[2018-06-19 12:31:25,771] INFO [Controller id=2] Starting become controller
state transition (kafka.controller.KafkaController)
[2018-06-19 12:31:25,810] INFO [Controller id=2] Initialized controller
epoch to 44 and zk version 43 (kafka.controller.KafkaController)
[2018-06-19 12:31:26,171] INFO [Controller id=2] Incremented epoch to 45
(kafka.controller.KafkaController)
[2018-06-19 12:31:26,847] DEBUG [Controller id=2] Registering
IsrChangeNotificationListener (kafka.controller.KafkaController)
[2018-06-19 12:31:26,940] DEBUG [Controller id=2] Registering
logDirEventNotificationListener (kafka.controller.KafkaController)
[2018-06-19 12:31:27,487] INFO [Controller id=2] Partitions being
reassigned: Map() (kafka.controller.KafkaController)
[2018-06-19 12:31:27,547] INFO [Controller id=2] Partitions already
reassigned: Set() (kafka.controller.KafkaController)
[2018-06-19 12:31:27,548] INFO [Controller id=2] Resuming reassignment of
partitions: Map() (kafka.controller.KafkaController)
[2018-06-19 12:31:27,554] INFO [Controller id=2] Currently active brokers
in the cluster: Set() (kafka.controller.KafkaController)
[2018-06-19 12:31:27,555] INFO [Controller id=2] Currently shutting brokers
in the cluster: Set() (kafka.controller.KafkaController)

*broker3 controller.log (nothing after last line):*
[2018-06-19 06:35:28,729] DEBUG [Controller id=3] Topics not in preferred
replica Map() (kafka.controller.KafkaController)
[2018-06-19 06:35:28,729] TRACE [Controller id=3] Leader imbalance ratio
for broker 2 is 0.0 (kafka.controller.KafkaController)
[2018-06-19 06:35:28,729] DEBUG [Controller id=3] Topics not in preferred
replica Map() (kafka.controller.KafkaController)
[2018-06-19 06:35:28,729] TRACE [Controller id=3] Leader imbalance ratio
for broker 1 is 0.0 (kafka.controller.KafkaController)
[2018-06-19 06:35:28,729] DEBUG [Controller id=3] Topics not in preferred
replica Map() (kafka.controller.KafkaController)
[2018-06-19 06:35:28,729] TRACE [Controller id=3] Leader imbalance ratio
for broker 3 is 0.0 (kafka.controller.KafkaController)

*broker3 server.log (zookeeper client timeouts):*
[2018-06-19 06:30:03,673] WARN Client session timed out, have not heard
from server in 4215ms for sessionid 0x363b74754d10026
(org.apache.zookeeper.ClientCnxn)
[2018-06-19 06:30:12,940] INFO Client session timed out, have not heard
from server in 4215ms for sessionid 0x363b74754d10026, closing socket
connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-06-19 06:30:20,251] INFO zookeeper state changed (Disconnected)
(org.I0Itec.zkclient.ZkClient)
[2018-06-19 06:30:20,925] INFO [GroupMetadataManager brokerId=3] Removed 0
expired offsets in 0 milliseconds.
(kafka.coordinator.group.GroupMetadataManager)
[2018-06-19 06:30:39,646] INFO Opening socket connection to server
ip-172-31-26-177.us-west-2.compute.internal/172.31.26.177:32181. Will not
attempt to authenticate using SASL (unknown error)
(org.apache.zookeeper.ClientCnxn)
[2018-06-19 06:30:57,159] INFO Socket connection established to
ip-172-31-26-177.us-west-2.compute.internal/172.31.26.177:32181, initiating
session (org.apache.zookeeper.ClientCnxn)
[2018-06-19 06:32:12,477] WARN Client session timed out, have not heard
from server in 68844ms for sessionid 0x363b74754d10026
(org.apache.zookeeper.ClientCnxn)

Reply via email to