Re: Why are checkpoint failures so serious?

Aljoscha Krettek Fri, 16 Feb 2018 01:07:17 -0800

Hi,

I think there's currently no option for achieving this on Flink 1.4.x.


Best,
Aljoscha

> On 15. Feb 2018, at 18:11, Ron Crocker <rcroc...@newrelic.com> wrote:
> 
> Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork 
> to get this, but I’ll do it if I have to.
> 
> Ron
> 
>> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <aljos...@apache.org> wrote:
>> 
>> Hi Ron,
>> 
>> Keep in mind, though, that this feature will only be available with the 
>> upcoming Flink 1.5. Just making sure you don't go looking for this and are 
>> surprised if you don't find it.
>> 
>> Best,
>> Aljoscha
>> 
>> 
>>> On 14. Feb 2018, at 10:20, Till Rohrmann <trohrm...@apache.org> wrote:
>>> 
>>> Hi Ron,
>>> 
>>> you should be able to turn off the Task failure in case of a checkpoint
>>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
>>> This setting should change the behavior such that checkpoint failures will
>>> simply fail the distributed checkpoint.
>>> 
>>> Cheers,
>>> Till
>>> 
>>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <rcroc...@newrelic.com> wrote:
>>> 
>>>> What would it take to be a little more flexible in handling checkpoint
>>>> failures?
>>>> 
>>>> Right now I have a team that’s checkpointing into S3, via the
>>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>>>> They’re transient, though, and a retry would likely work.
>>>> 
>>>> However, when they fail, their job exits and restarts from the last
>>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>>>> even after failing just keep running and do another checkpoint. Maybe this
>>>> is something that should be configurable - # of retries, failure strategy, 
>>>> …
>>>> 
>>>> Ron
>> 
>

Re: Why are checkpoint failures so serious?

Reply via email to