Hi, One of the major operational difficulties we observe with Flink are checkpoint timeouts under backpressure. I'm looking for both confirmation of my understanding of the current behavior as well as pointers for future improvement work:
Prior to introduction of credit based flow control in the network stack [1] [2], checkpoint barriers would back up with the data for all logical channels due to TCP backpressure. Since Flink 1.5, the buffers are controlled per channel, and checkpoint barriers are only held back for channels that have backpressure, while others can continue processing normally. However, checkpoint barriers still cannot "overtake data", therefore checkpoint alignment remains affected for the channel with backpressure, with the potential for slow checkpointing and timeouts. Albeit the delay of barriers would be capped by the maximum in-transit buffers per channel, resulting in an improvement compared to previous versions of Flink. Also, the backpressure based checkpoint alignment can help the barrier advance faster on the receiver side (by suspending channels that have already delivered the barrier). Is that accurate as of Flink 1.8? What appears to be missing to completely unblock checkpointing is a mechanism for checkpoints to overtake the data. That would help in situations where the processing itself is the bottleneck and prioritization in the network stack alone cannot address the barrier delay. Was there any related discussion? One possible solution would be to drain incoming data till the barrier and make it part of the checkpoint instead of processing it. This is somewhat related to asynchronous processing, but I'm thinking more of a solution that is automated in the Flink runtime for the backpressure scenario only. Thanks, Thomas [1] https://flink.apache.org/2019/06/05/flink-network-stack.html [2] https://docs.google.com/document/d/1chTOuOqe0sBsjldA_r-wXYeSIhU2zRGpUaTaik7QZ84/edit#heading=h.pjh6mv7m2hjn