Zhijiang created FLINK-16404:
--------------------------------

             Summary: Solve the potential deadlock problem when reducing 
exclusive buffers to zero
                 Key: FLINK-16404
                 URL: https://issues.apache.org/jira/browse/FLINK-16404
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Network
            Reporter: Zhijiang
             Fix For: 1.11.0


One motivation of this issue is for reducing the in-flight data in the case of 
back pressure to speed up checkpoint. The current default exclusive buffers per 
channel is 2. If we reduce it to 0 and increase somewhat floating buffers forĀ 
compensation, it might cause deadlock problem because all the floating buffers 
might be requested away by some blocked input channels and never recycled until 
barrier alignment.

In order to solve above deadlock concern, we can make some logic changes on 
both sender and receiver sides.
 * Sender side: it should revoke previous received credit after sending 
checkpoint barrier, that means it would not send any following buffers until 
receiving new credits.
 * Receiver side: after processing the barrier from one channel and setting it 
blocked, it should release the available floating buffers for this blocked 
channel, and restore requesting floating buffers until barrier alignment. That 
means the receiver would only announce new credits to sender side after barrier 
alignment.


Another possible benefit to do so is that the floating buffers might be more 
properly made use of before barrier alignment. We can further verify the 
performance concern via existing micro-benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to