Hi Ron, you should be able to turn off the Task failure in case of a checkpoint failure by setting `ExecutionConfig.setFailTaskOnCheckpointError(false)`. This setting should change the behavior such that checkpoint failures will simply fail the distributed checkpoint.
Cheers, Till On Tue, Feb 13, 2018 at 11:41 PM, Ron Crocker <rcroc...@newrelic.com> wrote: > What would it take to be a little more flexible in handling checkpoint > failures? > > Right now I have a team that’s checkpointing into S3, via the > FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. > They’re transient, though, and a retry would likely work. > > However, when they fail, their job exits and restarts from the last > checkpoint. That’s fine, but I’d rather it tried again before failing, and > even after failing just keep running and do another checkpoint. Maybe this > is something that should be configurable - # of retries, failure strategy, … > > Ron