Re: Having a backoff while experiencing checkpointing failures

Stefan Richter Mon, 11 Jun 2018 01:09:07 -0700

Hi,

I think the behaviour of min_pause_between_checkpoints is either buggy or we 
should at least discuss if it would not be better to respect a pause also for 
failed checkpoints. As far as I know there is no ongoing work to add backoff, 
so I suggest you open a jira issue and make a case for this.


Best,
Stefan

> Am 08.06.2018 um 06:30 schrieb vipul singh <[email protected]>:
> 
> Hello all,
> 
> Are there any recommendations on using a backoff when experiencing 
> checkpointing failures?
> What we have seen is when a checkpoint starts to expire, the next checkpoint 
> dosent care about the previous failure, and starts soon after. We 
> experimented with min_pause_between_checkpoints, however that seems only to 
> work for successful checkpoints( the same is discussed on this thread 
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/minPauseBetweenCheckpoints-for-failed-checkpoints-td20152.html>)
> 
> Are there any recommendations on how to have a backoff or is there something 
> in works to add a backoff incase of checkpointing failures? This seems very 
> valuable incase of checkpointing on an external location like s3, where one 
> can be potentially throttled or gets errors like TooBusyException from s3(for 
> example like in this jira <https://issues.apache.org/jira/browse/FLINK-9061>)
> 
> Please let us know!
> Thanks,
> Vipul

Re: Having a backoff while experiencing checkpointing failures

Reply via email to