José Armando García Sancio created KAFKA-14693: --------------------------------------------------
Summary: KRaft Controller and ProcessExitingFaultHandler can deadlock shutdown Key: KAFKA-14693 URL: https://issues.apache.org/jira/browse/KAFKA-14693 Project: Kafka Issue Type: Bug Components: controller Affects Versions: 3.4.0 Reporter: José Armando García Sancio Fix For: 3.4.1 h1. Problem When the kraft controller encounters an error that it cannot handle it calls `ProcessExitingFaultHandler` which calls `Exit.exit` which calls `Runtime.exit`. Based on the Runtime.exit documentation: > All registered [shutdown >hooks|https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#addShutdownHook-java.lang.Thread-], > if any, are started in some unspecified order and allowed to run concurrently >until they finish. Once this is done the virtual machine >[halts|https://docs.oracle.com/javase/8/docs/api/java/lang/Runtime.html#halt-int-]. One of the shutdown hooks registered by Kafka is `Server.shutdown()`. This shutdown hook eventually calls `KafkaEventQueue.close`. This last close method joins on the controller thread. Unfortunately, the controller thread also joined waiting for the shutdown hook thread to finish. Here are an sample thread stacks: {code:java} "QuorumControllerEventHandler" #45 prio=5 os_prio=0 cpu=429352.87ms elapsed=620807.49s allocated=38544M defined_classes=353 tid=0x00007f5aeb31f800 nid=0x80c in Object.wait() [0x00007f5a658fb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.5/Native Method) - waiting on <no object reference available> at java.lang.Thread.join(java.base@17.0.5/Thread.java:1304) - locked <0x00000000a29241f8> (a org.apache.kafka.common.utils.KafkaThread) at java.lang.Thread.join(java.base@17.0.5/Thread.java:1372) at java.lang.ApplicationShutdownHooks.runHooks(java.base@17.0.5/ApplicationShutdownHooks.java:107) at java.lang.ApplicationShutdownHooks$1.run(java.base@17.0.5/ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(java.base@17.0.5/Shutdown.java:130) at java.lang.Shutdown.exit(java.base@17.0.5/Shutdown.java:173) - locked <0x00000000ffe020b8> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(java.base@17.0.5/Runtime.java:115) at java.lang.System.exit(java.base@17.0.5/System.java:1860) at org.apache.kafka.common.utils.Exit$2.execute(Exit.java:43) at org.apache.kafka.common.utils.Exit.exit(Exit.java:66) at org.apache.kafka.common.utils.Exit.exit(Exit.java:62) at org.apache.kafka.server.fault.ProcessExitingFaultHandler.handleFault(ProcessExitingFaultHandler.java:54) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:891) at org.apache.kafka.controller.QuorumController$ControllerWriteEvent$1.apply(QuorumController.java:874) at org.apache.kafka.controller.QuorumController.appendRecords(QuorumController.java:969){code} and {code:java} "kafka-shutdown-hook" #35 prio=5 os_prio=0 cpu=43.42ms elapsed=378593.04s allocated=4732K defined_classes=74 tid=0x00007f5a7c09d800 nid=0x4f37 in Object.wait() [0x00007f5a47afd000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.5/Native Method) - waiting on <no object reference available> at java.lang.Thread.join(java.base@17.0.5/Thread.java:1304) - locked <0x00000000a272bcb0> (a org.apache.kafka.common.utils.KafkaThread) at java.lang.Thread.join(java.base@17.0.5/Thread.java:1372) at org.apache.kafka.queue.KafkaEventQueue.close(KafkaEventQueue.java:509) at org.apache.kafka.controller.QuorumController.close(QuorumController.java:2553) at kafka.server.ControllerServer.shutdown(ControllerServer.scala:521) at kafka.server.KafkaRaftServer.shutdown(KafkaRaftServer.scala:184) at kafka.Kafka$.$anonfun$main$3(Kafka.scala:99) at kafka.Kafka$$$Lambda$406/0x0000000800fb9730.apply$mcV$sp(Unknown Source) at kafka.utils.Exit$.$anonfun$addShutdownHook$1(Exit.scala:38) at kafka.Kafka$$$Lambda$407/0x0000000800fb9a10.run(Unknown Source) at java.lang.Thread.run(java.base@17.0.5/Thread.java:833) at org.apache.kafka.common.utils.KafkaThread.run(KafkaThread.java:64) {code} h1. Possible Solution A possible solution is to have the controller's unhandle fault handler call `Runtime.halt` instead of `Runtime.exit`. -- This message was sent by Atlassian Jira (v8.20.10#820010)