What would it take to be a little more flexible in handling checkpoint failures?
Right now I have a team that’s checkpointing into S3, via the FsStateBackend and an appropriate URL. Sometimes these checkpoints fail. They’re transient, though, and a retry would likely work. However, when they fail, their job exits and restarts from the last checkpoint. That’s fine, but I’d rather it tried again before failing, and even after failing just keep running and do another checkpoint. Maybe this is something that should be configurable - # of retries, failure strategy, … Ron