Re: Flink Savepoint fault tolerance

Arvid Heise Wed, 21 Apr 2021 07:06:01 -0700

Hi Dhanesh,

We recommend to use savepoints only for migrations, investigations, A/B
testing, and time travel and rely completely on checkpoints for fault
tolerance. Are you using it differently?


Currently, we are triggering savepoints using REST apis. And query the
> status of savepoint by the returned handle. In case there is a network
> issue because of which we couldn't receive response then in that case how
> to find out if the savepoint in the previous request was triggered or not?
> Is there a way to add "idempotency-key" to each API request so that we can
> safely retry triggering savepoint? By doing this, we want to avoid multiple
> triggers of consecutive savepoints during job upgrades.
>
I think you'd have to use your logging system and have a metric/trigger on
the respective line. I don't think there is any REST API for that.

Our workflow for capturing savepoint looks like this - call POST /savepoint
> endpoint. Use the returned trigger handle to periodically poll the status
> of savepoint. Once the savepoint is completed then restore the job from
> that savepoint. We are running our flink clusters in k8s. Since pod IPs can
> get restarted / migrated quite often in k8s, it's possible that the JM pod
> that was used to capture the savepoint happens to be recycled before
> completion of savepoint. In that case, we can't query the status of
> triggered savepoint from the previously returned handle. As neither the
> newly created JM pod or any other standby JMs have information about this
> savepoint. I couldn't find any config that makes Flink persist state of
> ongoing savepoints to an external store which will allow users to query the
> status of savepoint via any available JM instance in HA setup.
>
Not an expert on K8s but couldn't you expose the JM as a K8s service. That
should follow the migration automatically.

If one of the TMs crashes during ongoing checkpoint then I believe that
> checkpoint is marked as failed and on the next checkpoint interval Flink
> triggers a new checkpoint by looking at the previously completed checkpoint
> counter. The next checkpoint attempt might get acknowledged by all
> operators and marked as completed. Is that correct? In case of savepoints
> this is not possible. So how does flink resume the savepoint capturing
> process in case of job restarts or TM failures?
>
Savepoints have to be triggered anew. Savepoints are meant as a purely
manual feature. Again, you could automate it, if you look at the logs.

Best,

Arvid


On Fri, Apr 16, 2021 at 12:33 PM dhanesh arole <davcdhane...@gmail.com>
wrote:

> Hello all,
>
> I had 2 questions regarding savepoint fault tolerance.
>
> Job manager restart:
>
>    - Currently, we are triggering savepoints using REST apis. And query
>    the status of savepoint by the returned handle. In case there is a network
>    issue because of which we couldn't receive response then in that case how
>    to find out if the savepoint in the previous request was triggered or not?
>    Is there a way to add "idempotency-key" to each API request so that we can
>    safely retry triggering savepoint? By doing this, we want to avoid multiple
>    triggers of consecutive savepoints during job upgrades.
>    - Our workflow for capturing savepoint looks like this - call POST
>    /savepoint endpoint. Use the returned trigger handle to periodically poll
>    the status of savepoint. Once the savepoint is completed then restore the
>    job from that savepoint. We are running our flink clusters in k8s. Since
>    pod IPs can get restarted / migrated quite often in k8s, it's possible that
>    the JM pod that was used to capture the savepoint happens to be recycled
>    before completion of savepoint. In that case, we can't query the status of
>    triggered savepoint from the previously returned handle. As neither the
>    newly created JM pod or any other standby JMs have information about this
>    savepoint. I couldn't find any config that makes Flink persist state of
>    ongoing savepoints to an external store which will allow users to query the
>    status of savepoint via any available JM instance in HA setup.
>
>
> Task manager restart:
>
>    - If one of the TMs crashes during ongoing checkpoint then I believe
>    that checkpoint is marked as failed and on the next checkpoint interval
>    Flink triggers a new checkpoint by looking at the previously completed
>    checkpoint counter. The next checkpoint attempt might get acknowledged by
>    all operators and marked as completed. Is that correct? In case of
>    savepoints this is not possible. So how does flink resume the savepoint
>    capturing process in case of job restarts or TM failures?
>    - I am sure this must be already handled but just wanted to confirm
>    and get help in finding relevant code references for this so I can dig
>    deeper for understanding it in depth from an educational point of view.
>
>
> -
> Dhanesh Arole ( Sent from mobile device. Pardon me for typos )
>
>

Re: Flink Savepoint fault tolerance

Reply via email to