Hi everyone, Recently we had a cluster in which the controller failed to connect to a broker A for an extended period of time. I had expected that the controller would identify the broker as a failed broker, and re-elect another broker as the leader for partitions that were hosted on broker A. However, this did not happen in that cluster. What happened was that broker A was still considered as the leader for some partitions, and those partitions are marked as under replicated partitions. Is there any configuration setting in kafka to speed up the broker failure detection?
2018-01-24 14:13:57,132] WARN [Controller-37-to-broker-4-send-thread], Controller 37's connection to broker testkafka04:9092 (id: 4 rack: null) was unsuccessful (kafka.controller.RequestSendThread) java.net.SocketTimeoutException: Failed to connect within 30000 ms at kafka.controller.RequestSendThread.brokerReady(ControllerChannelManager.scala:231) at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:182) at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:181) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) Thanks! Regards, -Yu