[ https://issues.apache.org/jira/browse/KAFKA-14287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen updated KAFKA-14287: ------------------------------ Description: Multiple nodes with kraft combined mode (i.e. process.roles='broker,controller') can startup successfully. When shutdown in combined mode, we'll unfence broker first. When the remaining controller nodes are less than quorum size (i.e. N / 2 + 1), the unfence record will not get committed to metadata topic successfully. So the broker will keep waiting for the shutdown granting response and then timeout error: {code:java} 2022-10-11 18:01:14,341] ERROR [kafka-raft-io-thread]: Graceful shutdown of RaftClient failed (kafka.raft.KafkaRaftManager$RaftIoThread) java.util.concurrent.TimeoutException: Timeout expired before graceful shutdown completed at org.apache.kafka.raft.KafkaRaftClient$GracefulShutdown.failWithTimeout(KafkaRaftClient.java:2408) at org.apache.kafka.raft.KafkaRaftClient.maybeCompleteShutdown(KafkaRaftClient.java:2163) at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2230) at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:52) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} to reproduce: # start up 2 kraft combines nodes, so we need 2 nodes get quorum # shutdown any one node, in this time, it will shutdown successfully because when broker shutdown, the 2 controllers are all alive, so broker can be granted for shutdown # shutdown 2nd node, this time, the shutdown will be pending, and then timeout was: Multiple nodes with kraft combined mode (i.e. process.roles='broker,controller') can startup successfully. When shutdown in combined mode, we'll unfence broker first. When the remaining controller nodes are less than quorum size (i.e. N / 2 + 1), the unfence record will not get committed to metadata topic successfully. So the broker will keep waiting for the shutdown granting response and then timeout error: {code:java} 2022-10-11 18:01:14,341] ERROR [kafka-raft-io-thread]: Graceful shutdown of RaftClient failed (kafka.raft.KafkaRaftManager$RaftIoThread) java.util.concurrent.TimeoutException: Timeout expired before graceful shutdown completed at org.apache.kafka.raft.KafkaRaftClient$GracefulShutdown.failWithTimeout(KafkaRaftClient.java:2408) at org.apache.kafka.raft.KafkaRaftClient.maybeCompleteShutdown(KafkaRaftClient.java:2163) at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2230) at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:52) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} > Multi noded with kraft combined mode will fail shutdown > ------------------------------------------------------- > > Key: KAFKA-14287 > URL: https://issues.apache.org/jira/browse/KAFKA-14287 > Project: Kafka > Issue Type: Bug > Components: kraft > Affects Versions: 3.3.1 > Reporter: Luke Chen > Assignee: Luke Chen > Priority: Major > > Multiple nodes with kraft combined mode (i.e. > process.roles='broker,controller') can startup successfully. When shutdown in > combined mode, we'll unfence broker first. When the remaining controller > nodes are less than quorum size (i.e. N / 2 + 1), the unfence record will not > get committed to metadata topic successfully. So the broker will keep waiting > for the shutdown granting response and then timeout error: > > {code:java} > 2022-10-11 18:01:14,341] ERROR [kafka-raft-io-thread]: Graceful shutdown of > RaftClient failed (kafka.raft.KafkaRaftManager$RaftIoThread) > java.util.concurrent.TimeoutException: Timeout expired before graceful > shutdown completed > at > org.apache.kafka.raft.KafkaRaftClient$GracefulShutdown.failWithTimeout(KafkaRaftClient.java:2408) > at > org.apache.kafka.raft.KafkaRaftClient.maybeCompleteShutdown(KafkaRaftClient.java:2163) > at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2230) > at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:52) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) > {code} > > > to reproduce: > # start up 2 kraft combines nodes, so we need 2 nodes get quorum > # shutdown any one node, in this time, it will shutdown successfully because > when broker shutdown, the 2 controllers are all alive, so broker can be > granted for shutdown > # shutdown 2nd node, this time, the shutdown will be pending, and then > timeout -- This message was sent by Atlassian Jira (v8.20.10#820010)