Having a backoff while experiencing checkpointing failures

vipul singh Thu, 07 Jun 2018 21:31:23 -0700

Hello all,

Are there any recommendations on using a backoff when experiencing
checkpointing failures?
What we have seen is when a checkpoint starts to expire, the next
checkpoint dosent care about the previous failure, and starts soon after.
We experimented with *min_pause_between_checkpoints*, however that seems
only to work for successful checkpoints( the same is discussed on this
thread
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/minPauseBetweenCheckpoints-for-failed-checkpoints-td20152.html>
)


Are there any recommendations on how to have a backoff or is there
something in works to add a backoff incase of checkpointing failures? This
seems very valuable incase of checkpointing on an external location like
s3, where one can be potentially throttled or gets errors like
TooBusyException from s3(for example like in this jira
<https://issues.apache.org/jira/browse/FLINK-9061>)

Please let us know!
Thanks,
Vipul

Having a backoff while experiencing checkpointing failures

Reply via email to