Re: Why are checkpoint failures so serious?

Ron Crocker Thu, 15 Feb 2018 09:12:13 -0800

Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork 
to get this, but I’ll do it if I have to.


Ron

> On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <aljos...@apache.org> wrote:
> 
> Hi Ron,
> 
> Keep in mind, though, that this feature will only be available with the 
> upcoming Flink 1.5. Just making sure you don't go looking for this and are 
> surprised if you don't find it.
> 
> Best,
> Aljoscha
> 
> 
>> On 14. Feb 2018, at 10:20, Till Rohrmann <trohrm...@apache.org> wrote:
>> 
>> Hi Ron,
>> 
>> you should be able to turn off the Task failure in case of a checkpoint
>> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`.
>> This setting should change the behavior such that checkpoint failures will
>> simply fail the distributed checkpoint.
>> 
>> Cheers,
>> Till
>> 
>> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <rcroc...@newrelic.com> wrote:
>> 
>>> What would it take to be a little more flexible in handling checkpoint
>>> failures?
>>> 
>>> Right now I have a team that’s checkpointing into S3, via the
>>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail.
>>> They’re transient, though, and a retry would likely work.
>>> 
>>> However, when they fail, their job exits and restarts from the last
>>> checkpoint. That’s fine, but I’d rather it tried again before failing, and
>>> even after failing just keep running and do another checkpoint. Maybe this
>>> is something that should be configurable - # of retries, failure strategy, …
>>> 
>>> Ron
>

Re: Why are checkpoint failures so serious?

Reply via email to