[GitHub] [flink] zhijiangW edited a comment on pull request #11687: [FLINK-16536][network][checkpointing] Implement InputChannel state recovery for unaligned checkpoint

GitBox Mon, 11 May 2020 04:48:09 -0700


zhijiangW edited a comment on pull request #11687:
URL: https://github.com/apache/flink/pull/11687#issuecomment-626652730



   In previous way it already existed the deadlock case verified by 
`FastFailuresITCase`. 
   
   1. The future complete is done in task thread while 
`RecoveredInputChannel#getNextRecoveredStateBuffer`, then it is within the 
synchronized lock `receivedBuffers`. Then the task thread would execute 
partition request immediately. During partition request, it needs to occupy 
`SingleInputGate#requestLock`.
   
   2. In the meanwhile, the canceler thread is calling `SingleInputGate#close` 
to occupy `requestLock` already. Then when it calls 
`RecoveredInputChannel#releaseAllResources`, it waits to synchronize  the lock 
of `RemoteInputChannel#receivedBuffers`. Note that this lock was already 
occupied by task thread in above first step.
   
   3. So the canceler thread would wait for the task thread to free 
`RemoteInputChannel#receivedBuffers` lock, and the task thread is waiting for 
the canceler thread free `SingleInputGate#requestLock`.
   
   Actually it only needs one of my above two changes to resolve this deadlock, 
either completing the future out of `RemoteInputChannel#receivedBuffers` lock, 
or add the partition request mail in `StreamTask` instead to avoid trying to 
lock `SingleInputGate#requestLock` immediately. 
   
   Finally I made both changes because:
   1. complete the future actually does not need the lock, so better to do it 
outside.
   2. Enqueue the mail into mailbox can avoid checking whether the future is 
completed by task thread or not.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] zhijiangW edited a comment on pull request #11687: [FLINK-16536][network][checkpointing] Implement InputChannel state recovery for unaligned checkpoint

Reply via email to