Re: Flink Operator - Supporting Recovery from Snapshot

Gyula Fóra Wed, 06 Dec 2023 07:02:41 -0800

Hi All!

Based on some continuous feedback and experience, we feel that it may be a
good time to introduce this functionality in a way that doesn't
accidentally affect existing users in an unexpected way.


Please see: https://issues.apache.org/jira/browse/FLINK-33763 for details
and review.

Cheers,
Gyula

On Fri, Feb 10, 2023 at 7:27 PM Kevin Lam <kevin....@shopify.com.invalid>
wrote:

> Hey Yaroslav!
>
> Awesome, good to know that approach works well for you. I think our plan as
> of now is to do the same--delete the current FlinkDeployment when deploying
> from a specific snapshot. It'll be a separate workflow from normal
> deployments to take advantage of the operator otherwise.
>
> Thanks!
>
> On Fri, Feb 10, 2023 at 12:23 PM Yaroslav Tkachenko
> <yaros...@goldsky.com.invalid> wrote:
>
> > Hi Kevin!
> >
> > In my case, I automated this workflow by first deleting the current Flink
> > deployment and then creating a new one. So, if the initialSavepointPath
> is
> > different it'll use it for recovery.
> >
> > This approach is indeed irreversible, but so far it's been working well.
> >
> > On Fri, Feb 10, 2023 at 8:17 AM Kevin Lam <kevin....@shopify.com.invalid
> >
> > wrote:
> >
> > > Thanks for the response Gyula! Those caveats make sense, and I see,
> > there's
> > > a bit of a complexity to consider if the feature is implemented. I do
> > think
> > > it would be useful, so would also love to hear what others think!
> > >
> > > On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gyula.f...@gmail.com>
> wrote:
> > >
> > > > Hi Kevin!
> > > >
> > > > Thanks for starting this discussion.
> > > >
> > > > On a high level what you are proposing is quite simple: if the
> initial
> > > > savepoint path changes we use that for the upgrade.
> > > >
> > > > I see a few caveats here that may be important:
> > > >
> > > >  1. To use a new savepoint/checkpoint path for recovery we have to
> stop
> > > the
> > > > job and delete all HA metadata. This means that this operation may
> not
> > be
> > > > "reversible" in some cases because we lose the checkpoint info with
> the
> > > HA
> > > > metadata (unless we force a savepoint on shutdown).
> > > >  2. This will break the current upgrade/checkpoint ownership model in
> > > which
> > > > the operator controls the checkpoints and ensures that you always get
> > the
> > > > latest (or an error). It will also make the reconciliation logic more
> > > > complex
> > > >  3. This could be a breaking change for current users (if for some
> > reason
> > > > they rely on the current behaviour, which is weird but still true)
> > > >  4. The name initialSavepointPath becomes a bit misleading
> > > >
> > > > I agree that it would be nice to make this easier for the user, but
> the
> > > > question is whether what we gain by this is worth the extra
> complexity.
> > > > I think under normal circumstances the user does not really want to
> > > > suddenly redeploy the job starting from a new state. If that happens
> I
> > > > think it makes sense to create a new deployment resource and it's
> not a
> > > > very big overhead.
> > > >
> > > > Currently when "manual" recovery is needed are cases when the
> operator
> > > > loses track of the latest checkpoint, mostly due to "incorrect" error
> > > > handling on the Flink side that also deletes the HA metadata. I think
> > we
> > > > should strive to improve and eliminate most of these cases (as we
> have
> > > > already done for many of these problems).
> > > >
> > > > Would be great to hear what others think about this topic!
> > > >
> > > > Cheers,
> > > > Gyula
> > > >
> > > > On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam
> > <kevin....@shopify.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I was reading the Flink Kubernetes Operator documentation and
> noticed
> > > > that
> > > > > if you want to redeploy a Flink job from a specific snapshot, you
> > must
> > > > > follow these manual recovery steps. Are there plans to streamline
> > this
> > > > > process? Deploying from a specific snapshot is a relatively common
> > > > > operation and it'd be nice to not need to delete the
> FlinkDeployment
> > > > >
> > > > > I wonder if the Flink Operator could use the initialSavepointPath
> > > similar
> > > > > to the restartNonce and savepointTriggerNonce parameters, where if
> > > > > initialSavepointPath changes, the deployed job is restored from the
> > > > > specified savepoint. Any thoughts?
> > > > >
> > > > > Thanks!
> > > > >
> > > >
> > >
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Reply via email to