Re: Flink Operator - Supporting Recovery from Snapshot

Kevin Lam Fri, 10 Feb 2023 10:26:57 -0800

Hey Yaroslav!

Awesome, good to know that approach works well for you. I think our plan as
of now is to do the same--delete the current FlinkDeployment when deploying
from a specific snapshot. It'll be a separate workflow from normal
deployments to take advantage of the operator otherwise.


Thanks!

On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko
<[email protected]> wrote:

> Hi Kevin!
>
> In my case, I automated this workflow by first deleting the current Flink
> deployment and then creating a new one. So, if the initialSavepointPath is
> different it'll use it for recovery.
>
> This approach is indeed irreversible, but so far it's been working well.
>
> On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <[email protected]>
> wrote:
>
> > Thanks for the response Gyula! Those caveats make sense, and I see,
> there's
> > a bit of a complexity to consider if the feature is implemented. I do
> think
> > it would be useful, so would also love to hear what others think!
> >
> > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <[email protected]> wrote:
> >
> > > Hi Kevin!
> > >
> > > Thanks for starting this discussion.
> > >
> > > On a high level what you are proposing is quite simple: if the initial
> > > savepoint path changes we use that for the upgrade.
> > >
> > > I see a few caveats here that may be important:
> > >
> > >  1. To use a new savepoint/checkpoint path for recovery we have to stop
> > the
> > > job and delete all HA metadata. This means that this operation may not
> be
> > > "reversible" in some cases because we lose the checkpoint info with the
> > HA
> > > metadata (unless we force a savepoint on shutdown).
> > >  2. This will break the current upgrade/checkpoint ownership model in
> > which
> > > the operator controls the checkpoints and ensures that you always get
> the
> > > latest (or an error). It will also make the reconciliation logic more
> > > complex
> > >  3. This could be a breaking change for current users (if for some
> reason
> > > they rely on the current behaviour, which is weird but still true)
> > >  4. The name initialSavepointPath becomes a bit misleading
> > >
> > > I agree that it would be nice to make this easier for the user, but the
> > > question is whether what we gain by this is worth the extra complexity.
> > > I think under normal circumstances the user does not really want to
> > > suddenly redeploy the job starting from a new state. If that happens I
> > > think it makes sense to create a new deployment resource and it's not a
> > > very big overhead.
> > >
> > > Currently when "manual" recovery is needed are cases when the operator
> > > loses track of the latest checkpoint, mostly due to "incorrect" error
> > > handling on the Flink side that also deletes the HA metadata. I think
> we
> > > should strive to improve and eliminate most of these cases (as we have
> > > already done for many of these problems).
> > >
> > > Would be great to hear what others think about this topic!
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam
> <[email protected]
> > >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I was reading the Flink Kubernetes Operator documentation and noticed
> > > that
> > > > if you want to redeploy a Flink job from a specific snapshot, you
> must
> > > > follow these manual recovery steps. Are there plans to streamline
> this
> > > > process? Deploying from a specific snapshot is a relatively common
> > > > operation and it'd be nice to not need to delete the FlinkDeployment
> > > >
> > > > I wonder if the Flink Operator could use the initialSavepointPath
> > similar
> > > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > > initialSavepointPath changes, the deployed job is restored from the
> > > > specified savepoint. Any thoughts?
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Reply via email to