[
https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040257#comment-16040257
]
Michael Jaschob commented on KAFKA-5395:
----------------------------------------
Probably didn't mean to synchronize the public
[close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
method. Think removing synchronization there would fix this (the protected
version synchronizes after the heartbeat thread has joined).
> Distributed Herder Deadlocks on Shutdown
> ----------------------------------------
>
> Key: KAFKA-5395
> URL: https://issues.apache.org/jira/browse/KAFKA-5395
> Project: Kafka
> Issue Type: Bug
> Components: KafkaConnect
> Affects Versions: 0.10.2.1
> Reporter: Michael Jaschob
> Attachments: connect_01021_shutdown_deadlock.txt
>
>
> We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process
> does not shut down cleanly. It hangs instead. From what I can tell
> [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
> introduced this deadlock.
> [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
> on the AbstractCoordinator is marked as synchronized and acquires the
> coordinator's monitor. The first thing it tries to do is
> [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
> the heartbeat thread.
> Meanwhile, the heartbeat thread is [synchronized on the same
> monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
> which it relinquishes when it
> [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
> But for the wait to return (and the run method of the heartbeat to
> terminate) it needs to reacquire that monitor.
> There's no way for the heartbeat thread to reacquire the monitor since it is
> held by the distributed herder thread. And the distributed herder will never
> relinquish the monitor since it is waiting for the heartbeat thread to join.
> I am attaching a thread dump illustrating the situation. Take note in
> particular of threads #178 (the heartbeat thread) and #159 (the herder
> thread). The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and
> the latter is WAITING on the heartbeat thread to join, having itself acquired
> 0x00000007406cc0c0.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)