[ https://issues.apache.org/jira/browse/KAFKA-9359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lucas Bradstreet updated KAFKA-9359: ------------------------------------ Description: When a broker is shutdown it first tries to go through a controlled shutdown, resigning leadership of its partitions and then it stops the socket server from processing requests and shuts down the various data plane and control plane handlers and processors: {noformat} if (socketServer != null) CoreUtils.swallow(socketServer.stopProcessingRequests(), this) if (dataPlaneRequestHandlerPool != null) CoreUtils.swallow(dataPlaneRequestHandlerPool.shutdown(), this) if (controlPlaneRequestHandlerPool != null) CoreUtils.swallow(controlPlaneRequestHandlerPool.shutdown(), this) if (kafkaScheduler != null) CoreUtils.swallow(kafkaScheduler.shutdown(), this) if (dataPlaneRequestProcessor != null) CoreUtils.swallow(dataPlaneRequestProcessor.close(), this) if (controlPlaneRequestProcessor != null) CoreUtils.swallow(controlPlaneRequestProcessor.close(), this){noformat} The kafkaController component is only shut down much later, after closing the logManager, a process which may take some time as log closing requires checkpointing state and flushing segments. If the broker being shutdown is the controller, this means there is a potentially large window in which no controller is processing controller requests. Only when the controller component is shutdown and the zkClient is closed will the controller resign leadership. There is a second problem in that a broker that does not successfully undergo controlled shutdown will also remain the leader for its partitions until the zkClient is shutdown, and the potential window there is large due to the aforementioned log manager shutdown. It would be ideal if: # controller leadership is resigned early in the shutdown process before request handling is stopped. Care will have to be taken so that the broker in question cannot regain it. # we can reduce the window between an uncontrolled shutdown and resigning leadership of partitions through the zkclient close failsafe. See also https://issues.apache.org/jira/browse/KAFKA-9358 was: When a broker is shutdown it stops accepting requests, as it immediately socket server and handler pools are shutdown. It does so before shutting down the controller and or closing the log manager, and this may take some time to complete. During this time it will remain the controller as the zkClient has not been closed. We should improve the shutdown process such that a broker does not remain the controller while it is unable to accept requests that is expected of a controller. See also https://issues.apache.org/jira/browse/KAFKA-9358 > Controller does not handle requests while broker is being shutdown > ------------------------------------------------------------------ > > Key: KAFKA-9359 > URL: https://issues.apache.org/jira/browse/KAFKA-9359 > Project: Kafka > Issue Type: Improvement > Components: controller, core > Reporter: Lucas Bradstreet > Priority: Major > > When a broker is shutdown it first tries to go through a controlled shutdown, > resigning leadership of its partitions and then it stops the socket server > from processing requests and shuts down the various data plane and control > plane handlers and processors: > > {noformat} > if (socketServer != null) > CoreUtils.swallow(socketServer.stopProcessingRequests(), this) > if (dataPlaneRequestHandlerPool != null) > CoreUtils.swallow(dataPlaneRequestHandlerPool.shutdown(), this) > if (controlPlaneRequestHandlerPool != null) > CoreUtils.swallow(controlPlaneRequestHandlerPool.shutdown(), this) > if (kafkaScheduler != null) > CoreUtils.swallow(kafkaScheduler.shutdown(), this) > if (dataPlaneRequestProcessor != null) > CoreUtils.swallow(dataPlaneRequestProcessor.close(), this) > if (controlPlaneRequestProcessor != null) > CoreUtils.swallow(controlPlaneRequestProcessor.close(), this){noformat} > The kafkaController component is only shut down much later, after closing the > logManager, a process which may take some time as log closing requires > checkpointing state and flushing segments. If the broker being shutdown is > the controller, this means there is a potentially large window in which no > controller is processing controller requests. Only when the controller > component is shutdown and the zkClient is closed will the controller resign > leadership. > There is a second problem in that a broker that does not successfully undergo > controlled shutdown will also remain the leader for its partitions until the > zkClient is shutdown, and the potential window there is large due to the > aforementioned log manager shutdown. > It would be ideal if: > # controller leadership is resigned early in the shutdown process before > request handling is stopped. Care will have to be taken so that the broker in > question cannot regain it. > # we can reduce the window between an uncontrolled shutdown and resigning > leadership of partitions through the zkclient close failsafe. > See also https://issues.apache.org/jira/browse/KAFKA-9358 -- This message was sent by Atlassian Jira (v8.3.4#803005)