Hi Kevin, Theoretically, as long as you move over all k8s resources, failover should work fine on the Flink and Flink Operator side. The tricky part is the handover. You will need to backup all resources from the old cluster, shutdown the old cluster, then re-create them on the new cluster. The operator deployment and the Flink cluster should then recover fine (assuming that high availability has been configured and checkpointing is done to persistent storage available in the new cluster). The operator state / Flink state is actually kept in ConfigMaps which would be part of the resource dump.
This method has proven to work in case of Kubernetes cluster upgrades. Moving to an entirely new cluster is a bit more involved but exporting all resource definitions and re-importing them into the new cluster should yield the same result as long as the checkpoint paths do not change. Probably something worth trying :) -Max On Wed, Mar 6, 2024 at 9:09 PM Kevin Lam <kevin....@shopify.com.invalid> wrote: > > Another thought could be modifying the operator to have a behaviour where > upon first deploy, it optionally (flag/param enabled) finds the most recent > snapshot and uses that as the initialSavepointPath to restore and run the > Flink job. > > On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam <kevin....@shopify.com> wrote: > > > Hi there, > > > > We use the Flink Kubernetes Operator, and I am investigating how we can > > easily support failing over a FlinkDeployment from one Kubernetes Cluster > > to another in the case of an outage that requires us to migrate a large > > number of FlinkDeployments from one K8s cluster to another. > > > > I understand one way to do this is to set `initialSavepoint` on all the > > FlinkDeployments to the most recent/appropriate snapshot so the jobs > > continue from where they left off, but for a large number of jobs, this > > would be quite a bit of manual labor. > > > > Do others have an approach they are using? Any advice? > > > > Could this be something addressed in a future FLIP? Perhaps we could store > > some kind of metadata in object storage so that the Flink Kubernetes > > Operator can restore a FlinkDeployment from where it left off, even if the > > job is shifted to another Kubernetes Cluster. > > > > Looking forward to hearing folks' thoughts! > >