Re: Flink Operator - Supporting Recovery from Snapshot

Yaroslav Tkachenko Fri, 10 Feb 2023 09:23:45 -0800

Hi Kevin!

In my case, I automated this workflow by first deleting the current Flink
deployment and then creating a new one. So, if the initialSavepointPath is
different it'll use it for recovery.


This approach is indeed irreversible, but so far it's been working well.

On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <kevin....@shopify.com.invalid>
wrote:

> Thanks for the response Gyula! Those caveats make sense, and I see, there's
> a bit of a complexity to consider if the feature is implemented. I do think
> it would be useful, so would also love to hear what others think!
>
> On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
>
> > Hi Kevin!
> >
> > Thanks for starting this discussion.
> >
> > On a high level what you are proposing is quite simple: if the initial
> > savepoint path changes we use that for the upgrade.
> >
> > I see a few caveats here that may be important:
> >
> >  1. To use a new savepoint/checkpoint path for recovery we have to stop
> the
> > job and delete all HA metadata. This means that this operation may not be
> > "reversible" in some cases because we lose the checkpoint info with the
> HA
> > metadata (unless we force a savepoint on shutdown).
> >  2. This will break the current upgrade/checkpoint ownership model in
> which
> > the operator controls the checkpoints and ensures that you always get the
> > latest (or an error). It will also make the reconciliation logic more
> > complex
> >  3. This could be a breaking change for current users (if for some reason
> > they rely on the current behaviour, which is weird but still true)
> >  4. The name initialSavepointPath becomes a bit misleading
> >
> > I agree that it would be nice to make this easier for the user, but the
> > question is whether what we gain by this is worth the extra complexity.
> > I think under normal circumstances the user does not really want to
> > suddenly redeploy the job starting from a new state. If that happens I
> > think it makes sense to create a new deployment resource and it's not a
> > very big overhead.
> >
> > Currently when "manual" recovery is needed are cases when the operator
> > loses track of the latest checkpoint, mostly due to "incorrect" error
> > handling on the Flink side that also deletes the HA metadata. I think we
> > should strive to improve and eliminate most of these cases (as we have
> > already done for many of these problems).
> >
> > Would be great to hear what others think about this topic!
> >
> > Cheers,
> > Gyula
> >
> > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <kevin....@shopify.com.invalid
> >
> > wrote:
> >
> > > Hello,
> > >
> > > I was reading the Flink Kubernetes Operator documentation and noticed
> > that
> > > if you want to redeploy a Flink job from a specific snapshot, you must
> > > follow these manual recovery steps. Are there plans to streamline this
> > > process? Deploying from a specific snapshot is a relatively common
> > > operation and it'd be nice to not need to delete the FlinkDeployment
> > >
> > > I wonder if the Flink Operator could use the initialSavepointPath
> similar
> > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > initialSavepointPath changes, the deployed job is restored from the
> > > specified savepoint. Any thoughts?
> > >
> > > Thanks!
> > >
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Reply via email to