Thanks for the response Gyula! Those caveats make sense, and I see, there's a bit of a complexity to consider if the feature is implemented. I do think it would be useful, so would also love to hear what others think!
On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gyula.f...@gmail.com> wrote: > Hi Kevin! > > Thanks for starting this discussion. > > On a high level what you are proposing is quite simple: if the initial > savepoint path changes we use that for the upgrade. > > I see a few caveats here that may be important: > > 1. To use a new savepoint/checkpoint path for recovery we have to stop the > job and delete all HA metadata. This means that this operation may not be > "reversible" in some cases because we lose the checkpoint info with the HA > metadata (unless we force a savepoint on shutdown). > 2. This will break the current upgrade/checkpoint ownership model in which > the operator controls the checkpoints and ensures that you always get the > latest (or an error). It will also make the reconciliation logic more > complex > 3. This could be a breaking change for current users (if for some reason > they rely on the current behaviour, which is weird but still true) > 4. The name initialSavepointPath becomes a bit misleading > > I agree that it would be nice to make this easier for the user, but the > question is whether what we gain by this is worth the extra complexity. > I think under normal circumstances the user does not really want to > suddenly redeploy the job starting from a new state. If that happens I > think it makes sense to create a new deployment resource and it's not a > very big overhead. > > Currently when "manual" recovery is needed are cases when the operator > loses track of the latest checkpoint, mostly due to "incorrect" error > handling on the Flink side that also deletes the HA metadata. I think we > should strive to improve and eliminate most of these cases (as we have > already done for many of these problems). > > Would be great to hear what others think about this topic! > > Cheers, > Gyula > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <kevin....@shopify.com.invalid> > wrote: > > > Hello, > > > > I was reading the Flink Kubernetes Operator documentation and noticed > that > > if you want to redeploy a Flink job from a specific snapshot, you must > > follow these manual recovery steps. Are there plans to streamline this > > process? Deploying from a specific snapshot is a relatively common > > operation and it'd be nice to not need to delete the FlinkDeployment > > > > I wonder if the Flink Operator could use the initialSavepointPath similar > > to the restartNonce and savepointTriggerNonce parameters, where if > > initialSavepointPath changes, the deployed job is restored from the > > specified savepoint. Any thoughts? > > > > Thanks! > > >