[ https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040965#comment-16040965 ]
ASF GitHub Bot commented on KAFKA-5395: --------------------------------------- Github user rajinisivaram closed the pull request at: https://github.com/apache/kafka/pull/3253 > Distributed Herder Deadlocks on Shutdown > ---------------------------------------- > > Key: KAFKA-5395 > URL: https://issues.apache.org/jira/browse/KAFKA-5395 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect > Affects Versions: 0.10.2.1 > Reporter: Michael Jaschob > Assignee: Rajini Sivaram > Priority: Critical > Fix For: 0.11.0.0, 0.10.2.2 > > Attachments: connect_01021_shutdown_deadlock.txt > > > We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process > does not shut down cleanly. It hangs instead. From what I can tell > [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4] > introduced this deadlock. > [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664] > on the AbstractCoordinator is marked as synchronized and acquires the > coordinator's monitor. The first thing it tries to do is > [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323] > the heartbeat thread. > Meanwhile, the heartbeat thread is [synchronized on the same > monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891], > which it relinquishes when it > [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926]. > But for the wait to return (and the run method of the heartbeat to > terminate) it needs to reacquire that monitor. > There's no way for the heartbeat thread to reacquire the monitor since it is > held by the distributed herder thread. And the distributed herder will never > relinquish the monitor since it is waiting for the heartbeat thread to join. > I am attaching a thread dump illustrating the situation. Take note in > particular of threads #178 (the heartbeat thread) and #159 (the herder > thread). The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and > the latter is WAITING on the heartbeat thread to join, having itself acquired > 0x00000007406cc0c0. -- This message was sent by Atlassian JIRA (v6.3.15#6346)