chia7712 commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-670293057
> Could you provide a bit more detail? The new deadlock you mentioned is caused by PR. This PR introduces extra lock (```ReentrantLock```) to ```DelayedJoin``` (By contrast, previous ```DelayedJoin``` has a group lock only). The thread has to get two locks (group lock and ReentrantLock) to complete ```DelayedJoin``` and so deadlock happens when two locks are held by different thread. Hence, my approach is to remove the extra lock introduced by this PR and so deadlock will be gone. The changes to ```DelayedJoin``` in this PR is to resolve deadlock happening on ```DelayedJoin#onComplete```. ```onComplete``` has to get other group lock to resolve delayed request but it is executed within a group lock. In order to resolve deadlock, this PR tries to move ```onComplete``` out of group lock. However, after thinking about this a bit more, why ```onComplete``` is executed within a lock? It is thread-safe already and it is not always executed within a lock (if it is executed on timeout). Hence, the implementation of my approach is to refactor ```DelayedOperation``` to move ```onComplete``` out of unnecessary lock. The refactor includes following changes. 1. ```forceComplete``` should be executed by ```DelayedOperation``` other than subclasss 1. ```forceComplete``` is executed after releasing lock ```scala private[server] def maybeTryComplete(): Boolean = { lock.lock val timeToComplete = try tryComplete() finally lock.unlock timeToComplete && forceComplete() } ``` 3. ```forceComplete``` is renamed to ```canComplete``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org