Hello all,

I had 2 questions regarding savepoint fault tolerance.

Job manager restart:

   - Currently, we are triggering savepoints using REST apis. And query the
   status of savepoint by the returned handle. In case there is a network
   issue because of which we couldn't receive response then in that case how
   to find out if the savepoint in the previous request was triggered or not?
   Is there a way to add "idempotency-key" to each API request so that we can
   safely retry triggering savepoint? By doing this, we want to avoid multiple
   triggers of consecutive savepoints during job upgrades.
   - Our workflow for capturing savepoint looks like this - call POST
   /savepoint endpoint. Use the returned trigger handle to periodically poll
   the status of savepoint. Once the savepoint is completed then restore the
   job from that savepoint. We are running our flink clusters in k8s. Since
   pod IPs can get restarted / migrated quite often in k8s, it's possible that
   the JM pod that was used to capture the savepoint happens to be recycled
   before completion of savepoint. In that case, we can't query the status of
   triggered savepoint from the previously returned handle. As neither the
   newly created JM pod or any other standby JMs have information about this
   savepoint. I couldn't find any config that makes Flink persist state of
   ongoing savepoints to an external store which will allow users to query the
   status of savepoint via any available JM instance in HA setup.


Task manager restart:

   - If one of the TMs crashes during ongoing checkpoint then I believe
   that checkpoint is marked as failed and on the next checkpoint interval
   Flink triggers a new checkpoint by looking at the previously completed
   checkpoint counter. The next checkpoint attempt might get acknowledged by
   all operators and marked as completed. Is that correct? In case of
   savepoints this is not possible. So how does flink resume the savepoint
   capturing process in case of job restarts or TM failures?
   - I am sure this must be already handled but just wanted to confirm and
   get help in finding relevant code references for this so I can dig deeper
   for understanding it in depth from an educational point of view.


-
Dhanesh Arole ( Sent from mobile device. Pardon me for typos )

Reply via email to