Thanks Till and Aljoscha. Are there good options for 1.4? I’d rather not fork to get this, but I’ll do it if I have to.
Ron > On Feb 14, 2018, at 2:43 AM, Aljoscha Krettek <aljos...@apache.org> wrote: > > Hi Ron, > > Keep in mind, though, that this feature will only be available with the > upcoming Flink 1.5. Just making sure you don't go looking for this and are > surprised if you don't find it. > > Best, > Aljoscha > > >> On 14. Feb 2018, at 10:20, Till Rohrmann <trohrm...@apache.org> wrote: >> >> Hi Ron, >> >> you should be able to turn off the Task failure in case of a checkpoint >> failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. >> This setting should change the behavior such that checkpoint failures will >> simply fail the distributed checkpoint. >> >> Cheers, >> Till >> >> On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <rcroc...@newrelic.com> wrote: >> >>> What would it take to be a little more flexible in handling checkpoint >>> failures? >>> >>> Right now I have a team that’s checkpointing into S3, via the >>> FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. >>> They’re transient, though, and a retry would likely work. >>> >>> However, when they fail, their job exits and restarts from the last >>> checkpoint. That’s fine, but I’d rather it tried again before failing, and >>> even after failing just keep running and do another checkpoint. Maybe this >>> is something that should be configurable - # of retries, failure strategy, … >>> >>> Ron >