[jira] [Commented] (KAFKA-5395) Distributed Herder Deadlocks on Shutdown

Michael Jaschob (JIRA) Tue, 06 Jun 2017 23:11:03 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040257#comment-16040257
 ]


Michael Jaschob commented on KAFKA-5395:
----------------------------------------

Probably didn't mean to synchronize the public 
[close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
 method. Think removing synchronization there would fix this (the protected 
version synchronizes after the heartbeat thread has joined).

> Distributed Herder Deadlocks on Shutdown
> ----------------------------------------
>
>                 Key: KAFKA-5395
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5395
>             Project: Kafka
>          Issue Type: Bug
>          Components: KafkaConnect
>    Affects Versions: 0.10.2.1
>            Reporter: Michael Jaschob
>         Attachments: connect_01021_shutdown_deadlock.txt
>
>
> We're trying to upgrade Kafka Connect to 0.10.2.1 and see that the process 
> does not shut down cleanly. It hangs instead. From what I can tell 
> [KAFKA-4786|https://github.com/apache/kafka/commit/ba4eafa7874988374abcd9f48fbab96abb2032a4]
>  introduced this deadlock.
> [close|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L664]
>  on the AbstractCoordinator is marked as synchronized and acquires the 
> coordinator's monitor. The first thing it tries to do is 
> [join|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L323]
>  the heartbeat thread.
> Meanwhile, the heartbeat thread is [synchronized on the same 
> monitor|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L891],
>  which it relinquishes when it 
> [waits|https://github.com/apache/kafka/blob/0.10.2.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java#L926].
>  But for the wait to return (and the run method of the heartbeat to 
> terminate) it needs to reacquire that monitor.
> There's no way for the heartbeat thread to reacquire the monitor since it is 
> held by the distributed herder thread. And the distributed herder will never 
> relinquish the monitor since it is waiting for the heartbeat thread to join.
> I am attaching a thread dump illustrating the situation. Take note in 
> particular of threads #178 (the heartbeat thread) and #159 (the herder 
> thread). The former is BLOCKED trying to reacquire 0x00000007406cc0c0, and 
> the latter is WAITING on the heartbeat thread to join, having itself acquired 
> 0x00000007406cc0c0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-5395) Distributed Herder Deadlocks on Shutdown

Reply via email to