zhijiangW edited a comment on pull request #11687: URL: https://github.com/apache/flink/pull/11687#issuecomment-626652730
In previous way it already existed the deadlock case verified by `FastFailuresITCase`. 1. The future complete is done in task thread while `RecoveredInputChannel#getNextRecoveredStateBuffer`, then it is within the synchronized lock `receivedBuffers`. Then the task thread would execute partition request immediately. During partition request, it needs to occupy `SingleInputGate#requestLock`. 2. In the meanwhile, the canceler thread is calling `SingleInputGate#close` to occupy `requestLock` already. Then when it calls `RecoveredInputChannel#releaseAllResources`, it waits to synchronize the lock of `RemoteInputChannel#receivedBuffers`. Note that this lock was already occupied by task thread in above first step. 3. So the canceler thread would wait for the task thread to free `RemoteInputChannel#receivedBuffers` lock, and the task thread is waiting for the canceler thread free `SingleInputGate#requestLock`. Actually it only needs one of my above two changes to resolve this deadlock, either completing the future out of `RemoteInputChannel#receivedBuffers` lock, or add the partition request mail in `StreamTask` instead to avoid trying to lock `SingleInputGate#requestLock` immediately. Finally I made both changes because: 1. complete the future actually does not need the lock, so better to do it outside. 2. Enqueue the mail into mailbox can avoid checking whether the future is completed by task thread or not. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org