Hi All! Based on some continuous feedback and experience, we feel that it may be a good time to introduce this functionality in a way that doesn't accidentally affect existing users in an unexpected way.
Please see: https://issues.apache.org/jira/browse/FLINK-33763 for details and review. Cheers, Gyula On Fri, Feb 10, 2023 at 7:27 PM Kevin Lam <kevin....@shopify.com.invalid> wrote: > Hey Yaroslav! > > Awesome, good to know that approach works well for you. I think our plan as > of now is to do the same--delete the current FlinkDeployment when deploying > from a specific snapshot. It'll be a separate workflow from normal > deployments to take advantage of the operator otherwise. > > Thanks! > > On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko > <yaros...@goldsky.com.invalid> wrote: > > > Hi Kevin! > > > > In my case, I automated this workflow by first deleting the current Flink > > deployment and then creating a new one. So, if the initialSavepointPath > is > > different it'll use it for recovery. > > > > This approach is indeed irreversible, but so far it's been working well. > > > > On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <kevin....@shopify.com.invalid > > > > wrote: > > > > > Thanks for the response Gyula! Those caveats make sense, and I see, > > there's > > > a bit of a complexity to consider if the feature is implemented. I do > > think > > > it would be useful, so would also love to hear what others think! > > > > > > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gyula.f...@gmail.com> > wrote: > > > > > > > Hi Kevin! > > > > > > > > Thanks for starting this discussion. > > > > > > > > On a high level what you are proposing is quite simple: if the > initial > > > > savepoint path changes we use that for the upgrade. > > > > > > > > I see a few caveats here that may be important: > > > > > > > > 1. To use a new savepoint/checkpoint path for recovery we have to > stop > > > the > > > > job and delete all HA metadata. This means that this operation may > not > > be > > > > "reversible" in some cases because we lose the checkpoint info with > the > > > HA > > > > metadata (unless we force a savepoint on shutdown). > > > > 2. This will break the current upgrade/checkpoint ownership model in > > > which > > > > the operator controls the checkpoints and ensures that you always get > > the > > > > latest (or an error). It will also make the reconciliation logic more > > > > complex > > > > 3. This could be a breaking change for current users (if for some > > reason > > > > they rely on the current behaviour, which is weird but still true) > > > > 4. The name initialSavepointPath becomes a bit misleading > > > > > > > > I agree that it would be nice to make this easier for the user, but > the > > > > question is whether what we gain by this is worth the extra > complexity. > > > > I think under normal circumstances the user does not really want to > > > > suddenly redeploy the job starting from a new state. If that happens > I > > > > think it makes sense to create a new deployment resource and it's > not a > > > > very big overhead. > > > > > > > > Currently when "manual" recovery is needed are cases when the > operator > > > > loses track of the latest checkpoint, mostly due to "incorrect" error > > > > handling on the Flink side that also deletes the HA metadata. I think > > we > > > > should strive to improve and eliminate most of these cases (as we > have > > > > already done for many of these problems). > > > > > > > > Would be great to hear what others think about this topic! > > > > > > > > Cheers, > > > > Gyula > > > > > > > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam > > <kevin....@shopify.com.invalid > > > > > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I was reading the Flink Kubernetes Operator documentation and > noticed > > > > that > > > > > if you want to redeploy a Flink job from a specific snapshot, you > > must > > > > > follow these manual recovery steps. Are there plans to streamline > > this > > > > > process? Deploying from a specific snapshot is a relatively common > > > > > operation and it'd be nice to not need to delete the > FlinkDeployment > > > > > > > > > > I wonder if the Flink Operator could use the initialSavepointPath > > > similar > > > > > to the restartNonce and savepointTriggerNonce parameters, where if > > > > > initialSavepointPath changes, the deployed job is restored from the > > > > > specified savepoint. Any thoughts? > > > > > > > > > > Thanks! > > > > > > > > > > > > > > >