Hi, I was wondering how Flink's fault tolerance works, because this page is short on the details: https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/fault_tolerance.html
My environment has a backend service that may be out for a couple of hours (sad, but working on fixing that). I have a sink that would like to write to that service, and in such cases it throws an exception. This brings the process down and I need to manually intervene to get it up and running again. I was thinking to rewrite the sink to loop until it is able to write the data (and have a multi-hour long tolarence before it throws an exception). I hope that it will create a backpressure on the process, "suspend" the processing and "resume" it when the backend service goes up again. Am I right with that assumption? Is there a better way to make suspending and resuming automatic? Thanks, Istvan