Re: Flink Operator - Supporting Recovery from Snapshot

Kevin Lam Fri, 10 Feb 2023 08:17:51 -0800

Thanks for the response Gyula! Those caveats make sense, and I see, there's
a bit of a complexity to consider if the feature is implemented. I do think
it would be useful, so would also love to hear what others think!


On Wed, Feb 8, 2023 at 3:47 AM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi Kevin!
>
> Thanks for starting this discussion.
>
> On a high level what you are proposing is quite simple: if the initial
> savepoint path changes we use that for the upgrade.
>
> I see a few caveats here that may be important:
>
>  1. To use a new savepoint/checkpoint path for recovery we have to stop the
> job and delete all HA metadata. This means that this operation may not be
> "reversible" in some cases because we lose the checkpoint info with the HA
> metadata (unless we force a savepoint on shutdown).
>  2. This will break the current upgrade/checkpoint ownership model in which
> the operator controls the checkpoints and ensures that you always get the
> latest (or an error). It will also make the reconciliation logic more
> complex
>  3. This could be a breaking change for current users (if for some reason
> they rely on the current behaviour, which is weird but still true)
>  4. The name initialSavepointPath becomes a bit misleading
>
> I agree that it would be nice to make this easier for the user, but the
> question is whether what we gain by this is worth the extra complexity.
> I think under normal circumstances the user does not really want to
> suddenly redeploy the job starting from a new state. If that happens I
> think it makes sense to create a new deployment resource and it's not a
> very big overhead.
>
> Currently when "manual" recovery is needed are cases when the operator
> loses track of the latest checkpoint, mostly due to "incorrect" error
> handling on the Flink side that also deletes the HA metadata. I think we
> should strive to improve and eliminate most of these cases (as we have
> already done for many of these problems).
>
> Would be great to hear what others think about this topic!
>
> Cheers,
> Gyula
>
> On Tue, Feb 7, 2023 at 10:43 PM Kevin Lam <kevin....@shopify.com.invalid>
> wrote:
>
> > Hello,
> >
> > I was reading the Flink Kubernetes Operator documentation and noticed
> that
> > if you want to redeploy a Flink job from a specific snapshot, you must
> > follow these manual recovery steps. Are there plans to streamline this
> > process? Deploying from a specific snapshot is a relatively common
> > operation and it'd be nice to not need to delete the FlinkDeployment
> >
> > I wonder if the Flink Operator could use the initialSavepointPath similar
> > to the restartNonce and savepointTriggerNonce parameters, where if
> > initialSavepointPath changes, the deployed job is restored from the
> > specified savepoint. Any thoughts?
> >
> > Thanks!
> >
>

Re: Flink Operator - Supporting Recovery from Snapshot

Reply via email to