Hi,

One of the major operational difficulties we observe with Flink are
checkpoint timeouts under backpressure. I'm looking for both confirmation
of my understanding of the current behavior as well as pointers for future
improvement work:

Prior to introduction of credit based flow control in the network stack [1]
[2], checkpoint barriers would back up with the data for all logical
channels due to TCP backpressure. Since Flink 1.5, the buffers are
controlled per channel, and checkpoint barriers are only held back for
channels that have backpressure, while others can continue processing
normally. However, checkpoint barriers still cannot "overtake data",
therefore checkpoint alignment remains affected for the channel with
backpressure, with the potential for slow checkpointing and timeouts.
Albeit the delay of barriers would be capped by the maximum in-transit
buffers per channel, resulting in an improvement compared to previous
versions of Flink. Also, the backpressure based checkpoint alignment can
help the barrier advance faster on the receiver side (by suspending
channels that have already delivered the barrier). Is that accurate as of
Flink 1.8?

What appears to be missing to completely unblock checkpointing is a
mechanism for checkpoints to overtake the data. That would help in
situations where the processing itself is the bottleneck and prioritization
in the network stack alone cannot address the barrier delay. Was there any
related discussion? One possible solution would be to drain incoming data
till the barrier and make it part of the checkpoint instead of processing
it. This is somewhat related to asynchronous processing, but I'm thinking
more of a solution that is automated in the Flink runtime for the
backpressure scenario only.

Thanks,
Thomas

[1] https://flink.apache.org/2019/06/05/flink-network-stack.html
[2]
https://docs.google.com/document/d/1chTOuOqe0sBsjldA_r-wXYeSIhU2zRGpUaTaik7QZ84/edit#heading=h.pjh6mv7m2hjn

Reply via email to