Hi Kevin! Thanks for starting this discussion.
On a high level what you are proposing is quite simple: if the initial savepoint path changes we use that for the upgrade. I see a few caveats here that may be important: 1. To use a new savepoint/checkpoint path for recovery we have to stop the job and delete all HA metadata. This means that this operation may not be "reversible" in some cases because we lose the checkpoint info with the HA metadata (unless we force a savepoint on shutdown). 2. This will break the current upgrade/checkpoint ownership model in which the operator controls the checkpoints and ensures that you always get the latest (or an error). It will also make the reconciliation logic more complex 3. This could be a breaking change for current users (if for some reason they rely on the current behaviour, which is weird but still true) 4. The name initialSavepointPath becomes a bit misleading I agree that it would be nice to make this easier for the user, but the question is whether what we gain by this is worth the extra complexity. I think under normal circumstances the user does not really want to suddenly redeploy the job starting from a new state. If that happens I think it makes sense to create a new deployment resource and it's not a very big overhead. Currently when "manual" recovery is needed are cases when the operator loses track of the latest checkpoint, mostly due to "incorrect" error handling on the Flink side that also deletes the HA metadata. I think we should strive to improve and eliminate most of these cases (as we have already done for many of these problems). Would be great to hear what others think about this topic! Cheers, Gyula On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <kevin....@shopify.com.invalid> wrote: > Hello, > > I was reading the Flink Kubernetes Operator documentation and noticed that > if you want to redeploy a Flink job from a specific snapshot, you must > follow these manual recovery steps. Are there plans to streamline this > process? Deploying from a specific snapshot is a relatively common > operation and it'd be nice to not need to delete the FlinkDeployment > > I wonder if the Flink Operator could use the initialSavepointPath similar > to the restartNonce and savepointTriggerNonce parameters, where if > initialSavepointPath changes, the deployed job is restored from the > specified savepoint. Any thoughts? > > Thanks! >