Hello all, I had 2 questions regarding savepoint fault tolerance.
Job manager restart: - Currently, we are triggering savepoints using REST apis. And query the status of savepoint by the returned handle. In case there is a network issue because of which we couldn't receive response then in that case how to find out if the savepoint in the previous request was triggered or not? Is there a way to add "idempotency-key" to each API request so that we can safely retry triggering savepoint? By doing this, we want to avoid multiple triggers of consecutive savepoints during job upgrades. - Our workflow for capturing savepoint looks like this - call POST /savepoint endpoint. Use the returned trigger handle to periodically poll the status of savepoint. Once the savepoint is completed then restore the job from that savepoint. We are running our flink clusters in k8s. Since pod IPs can get restarted / migrated quite often in k8s, it's possible that the JM pod that was used to capture the savepoint happens to be recycled before completion of savepoint. In that case, we can't query the status of triggered savepoint from the previously returned handle. As neither the newly created JM pod or any other standby JMs have information about this savepoint. I couldn't find any config that makes Flink persist state of ongoing savepoints to an external store which will allow users to query the status of savepoint via any available JM instance in HA setup. Task manager restart: - If one of the TMs crashes during ongoing checkpoint then I believe that checkpoint is marked as failed and on the next checkpoint interval Flink triggers a new checkpoint by looking at the previously completed checkpoint counter. The next checkpoint attempt might get acknowledged by all operators and marked as completed. Is that correct? In case of savepoints this is not possible. So how does flink resume the savepoint capturing process in case of job restarts or TM failures? - I am sure this must be already handled but just wanted to confirm and get help in finding relevant code references for this so I can dig deeper for understanding it in depth from an educational point of view. - Dhanesh Arole ( Sent from mobile device. Pardon me for typos )