Re: Checkpointing under backpressure

zhijiang Mon, 12 Aug 2019 14:23:48 -0700

Hi Thomas,

Thanks for proposing this concern. The barrier alignment takes long time in 
backpressure case which could cause several problems:
1. Checkpoint timeout as you mentioned.
2. The recovery cost is high once failover, because much data needs to be 
replayed.
3. The delay for commit-based sink is high in exactly-once.

For credit-based flow control from release-1.5, the amount of in-flighting
buffers before barrier alignment is reduced, so we could get a bit
benefits from speeding checkpoint aspect.

In release-1.8, I guess we did not suspend the channels which already received
the barrier in practice. But actually we ever did the similar thing
to speed barrier alighment before. I am not quite sure that release-1.8 covers
this feature. There were some relevant discussions under jira [1].

For release-1.10, the community is now discussing the feature of unaligned
checkpoint which is mainly for resolving above concerns. The basic idea
is to make barrier overtakes the output/input buffer queue to speed alignment,
and snapshot the input/output buffers as part of checkpoint state. The
details have not confirmed yet and is still under discussion. Wish we could
make some improvments for the release-1.10.

[1] https://issues.apache.org/jira/browse/FLINK-8523

Best,
Zhijiang
------------------------------------------------------------------
From:Thomas Weise <[email protected]>
Send Time:2019年8月12日(星期一) 21:38
To:dev <[email protected]>
Subject:Checkpointing under backpressure

Hi,

One of the major operational difficulties we observe with Flink are
checkpoint timeouts under backpressure. I'm looking for both confirmation
of my understanding of the current behavior as well as pointers for future
improvement work:

Prior to introduction of credit based flow control in the network stack [1]
[2], checkpoint barriers would back up with the data for all logical
channels due to TCP backpressure. Since Flink 1.5, the buffers are
controlled per channel, and checkpoint barriers are only held back for
channels that have backpressure, while others can continue processing
normally. However, checkpoint barriers still cannot "overtake data",
therefore checkpoint alignment remains affected for the channel with
backpressure, with the potential for slow checkpointing and timeouts.
Albeit the delay of barriers would be capped by the maximum in-transit
buffers per channel, resulting in an improvement compared to previous
versions of Flink. Also, the backpressure based checkpoint alignment can
help the barrier advance faster on the receiver side (by suspending
channels that have already delivered the barrier). Is that accurate as of
Flink 1.8?

What appears to be missing to completely unblock checkpointing is a
mechanism for checkpoints to overtake the data. That would help in
situations where the processing itself is the bottleneck and prioritization
in the network stack alone cannot address the barrier delay. Was there any
related discussion? One possible solution would be to drain incoming data
till the barrier and make it part of the checkpoint instead of processing
it. This is somewhat related to asynchronous processing, but I'm thinking
more of a solution that is automated in the Flink runtime for the
backpressure scenario only.

Thanks,
Thomas

[1] https://flink.apache.org/2019/06/05/flink-network-stack.html
[2]
https://docs.google.com/document/d/1chTOuOqe0sBsjldA_r-wXYeSIhU2zRGpUaTaik7QZ84/edit#heading=h.pjh6mv7m2hjn

Re: Checkpointing under backpressure

Reply via email to