Michael Jaschob created KAFKA-5395:
--------------------------------------

             Summary: Distributed Herder Deadlocks on Shutdown
                 Key: KAFKA-5395
                 URL: https://issues.apache.org/jira/browse/KAFKA-5395
             Project: Kafka
          Issue Type: Bug
          Components: KafkaConnect
    Affects Versions: 0.10.2.1
            Reporter: Michael Jaschob
         Attachments: connect_01021_shutdown_deadlock.txt

We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process does 
not shut down cleanly. It hangs instead. From what I can tell 
[KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
 introduced this deadlock.

[close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
 on the AbstractCoordinator is marked as synchronized and acquires the 
coordinator's monitor. The first thing it tries to do is 
[join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
 the heartbeat thread.

Meanwhile, the heartbeat thread is [synchronized on the same 
monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
 which it relinquishes when it 
[waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
 But for the wait to return (and the run method of the heartbeat to terminate) 
it needs to reacquire that monitor.

There's no way for the heartbeat thread to reacquire the monitor since it is 
held by the distributed herder thread. And the distributed herder will never 
relinquish the monitor since it is waiting for the heartbeat thread to join.

I am attaching a thread dump illustrating the situation. Take note in 
particular of threads #178 (the heartbeat thread) and #159 (the herder thread). 
The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and the latter is 
WAITING on the heartbeat thread to join, having itself acquired 
0x00000007406cc0c0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to