Re: Flink Operator - Supporting Recovery from Snapshot

Gyula Fóra Wed, 08 Feb 2023 00:47:22 -0800

Hi Kevin!

Thanks for starting this discussion.

On a high level what you are proposing is quite simple: if the initial
savepoint path changes we use that for the upgrade.

I see a few caveats here that may be important:

 1. To use a new savepoint/checkpoint path for recovery we have to stop the
job and delete all HA metadata. This means that this operation may not be
"reversible" in some cases because we lose the checkpoint info with the HA
metadata (unless we force a savepoint on shutdown).
 2. This will break the current upgrade/checkpoint ownership model in which
the operator controls the checkpoints and ensures that you always get the
latest (or an error). It will also make the reconciliation logic more
complex
 3. This could be a breaking change for current users (if for some reason
they rely on the current behaviour, which is weird but still true)
 4. The name initialSavepointPath becomes a bit misleading

I agree that it would be nice to make this easier for the user, but the
question is whether what we gain by this is worth the extra complexity.
I think under normal circumstances the user does not really want to
suddenly redeploy the job starting from a new state. If that happens I
think it makes sense to create a new deployment resource and it's not a
very big overhead.

Currently when "manual" recovery is needed are cases when the operator
loses track of the latest checkpoint, mostly due to "incorrect" error
handling on the Flink side that also deletes the HA metadata. I think we
should strive to improve and eliminate most of these cases (as we have
already done for many of these problems).

Would be great to hear what others think about this topic!

Cheers,
Gyula

On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <[email protected]>
wrote:

> Hello,
>
> I was reading the Flink Kubernetes Operator documentation and noticed that
> if you want to redeploy a Flink job from a specific snapshot, you must
> follow these manual recovery steps. Are there plans to streamline this
> process? Deploying from a specific snapshot is a relatively common
> operation and it'd be nice to not need to delete the FlinkDeployment
>
> I wonder if the Flink Operator could use the initialSavepointPath similar
> to the restartNonce and savepointTriggerNonce parameters, where if
> initialSavepointPath changes, the deployed job is restored from the
> specified savepoint. Any thoughts?
>
> Thanks!
>

Re: Flink Operator - Supporting Recovery from Snapshot

Reply via email to