Hi Kevin,

Theoretically, as long as you move over all k8s resources, failover
should work fine on the Flink and Flink Operator side. The tricky part
is the handover. You will need to backup all resources from the old
cluster, shutdown the old cluster, then re-create them on the new
cluster. The operator deployment and the Flink cluster should then
recover fine (assuming that high availability has been configured and
checkpointing is done to persistent storage available in the new
cluster). The operator state / Flink state is actually kept in
ConfigMaps which would be part of the resource dump.

This method has proven to work in case of Kubernetes cluster upgrades.
Moving to an entirely new cluster is a bit more involved but exporting
all resource definitions and re-importing them into the new cluster
should yield the same result as long as the checkpoint paths do not
change.

Probably something worth trying :)

-Max



On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam <kevin....@shopify.com.invalid> wrote:
>
> Another thought could be modifying the operator to have a behaviour where
> upon first deploy, it optionally (flag/param enabled) finds the most recent
> snapshot and uses that as the initialSavepointPath to restore and run the
> Flink job.
>
> On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam <kevin....@shopify.com> wrote:
>
> > Hi there,
> >
> > We use the Flink Kubernetes Operator, and I am investigating how we can
> > easily support failing over a FlinkDeployment from one Kubernetes Cluster
> > to another in the case of an outage that requires us to migrate a large
> > number of FlinkDeployments from one K8s cluster to another.
> >
> > I understand one way to do this is to set `initialSavepoint` on all the
> > FlinkDeployments to the most recent/appropriate snapshot so the jobs
> > continue from where they left off, but for a large number of jobs, this
> > would be quite a bit of manual labor.
> >
> > Do others have an approach they are using? Any advice?
> >
> > Could this be something addressed in a future FLIP? Perhaps we could store
> > some kind of metadata in object storage so that the Flink Kubernetes
> > Operator can restore a FlinkDeployment from where it left off, even if the
> > job is shifted to another Kubernetes Cluster.
> >
> > Looking forward to hearing folks' thoughts!
> >

Reply via email to