Hi Lakshmi,

you could somewhat achieve the described behaviour by setting
setFailOnCheckpointintErrors(true) and using the FailureRateRestartStrategy
as the restart strategy. That way checkpoint failures will trigger a job
restart (this is the downside) which is handled by the restart strategy.
The FailureRateRestartStrategy allows for x failures to happen within in a
given time interval. If this number is exceeded, then the job will
terminally fail.

Cheers,
Till

On Sat, Aug 4, 2018 at 4:58 AM vino yang <yanghua1...@gmail.com> wrote:

> Hi Lakshmi,
>
> Your understanding of "
> *CheckpointConfig#setFailOnCheckpointingErrors(false)*" is correct, If this
> is set to false, the task will only decline a the checkpoint and continue
> running.
>
> I think it is also a good choice to allow a number of failures to be set.
> Flink currently only supports whether the Task fails if the checkpoint
> fails. It is not supported to configure a threshold.
>
> You can create an issue in JIRA to feedback this requirement.
>
> Thanks, vino.
>
> 2018-08-04 4:28 GMT+08:00 Lakshmi Gururaja Rao <l...@lyft.com>:
>
> > Hi,
> >
> > We are running into intermittent checkpoint failures while checkpointing
> to
> > S3.
> >
> > As described in this thread -
> >  http://apache-flink-user-mailing-list-archive.2336050.
> > n4.nabble.com/1-5-some-thing-weird-td21309.html
> > <http://apache-flink-user-mailing-list-archive.2336050.
> > n4.nabble.com/1-5-some-thing-weird-td21309.html>,
> > we see that the job restarts when it encounters such a failure.
> >
> > As mentioned in the thread, I see that there is an option to not fail
> tasks
> > on checkpoint errors -
> > *CheckpointConfig#setFailOnCheckpointingErrors(false)**. *However, this
> > would mean that the job would continue running even in the case of
> > persistent checkpoint failures. Is my understanding here correct?
> >
> > If above is true, then is there a way to configure an allowable number of
> > checkpoint failures? i.e. something along the lines of "Don't fail the
> job
> > if there are <=X number of checkpoint failures", so that *only *transient
> > failures can be ignored.
> >
> > Thanks,
> > Lakshmi
> >
>

Reply via email to