I'm running into a problem with a 3 broker cluster where, intermittently, one of the broker's controller begins to report that it cannot connect to the other brokers and repeatedly logs the failure.
Each broker is running in its own Docker container on separate machines. These Docker containers have exposed 9092, which I think is sufficient for operation, but not sure. The log message are these: [2017-04-27 17:16:28,985] WARN [Controller-3-to-broker-2-send-thread], Controller 3's connection to broker 64174aa85d04:9092 (id: 2 rack: null) was unsuccessful (kafka.controller.RequestSendThread) java.io.IOException: Connection to 64174aa85d04:9092 (id: 2 rack: null) failed at kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84) at kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94) at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:232) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:185) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:184) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) [2017-04-27 17:16:28,986] WARN [Controller-3-to-broker-1-send-thread], Controller 3's connection to broker d4b8943ad4b5:9092 (id: 1 rack: null) was unsuccessful (kafka.controller.RequestSendThread) java.io.IOException: Connection to d4b8943ad4b5:9092 (id: 1 rack: null) failed at kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:84) at kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:94) at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:232) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:185) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:184) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) This is Kafka 2.12-0.10.2.0. I'm wondering: 1. How do we figure out the cause of the connect failures? 2. What's the controller anyway? 3. Are there some command-line diagnostic tools for inspecting the health of the system? Thanks for any help, Chuck