Another thought could be modifying the operator to have a behaviour where upon first deploy, it optionally (flag/param enabled) finds the most recent snapshot and uses that as the initialSavepointPath to restore and run the Flink job.
On Wed, Mar 6, 2024 at 2:07 PM Kevin Lam <kevin....@shopify.com> wrote: > Hi there, > > We use the Flink Kubernetes Operator, and I am investigating how we can > easily support failing over a FlinkDeployment from one Kubernetes Cluster > to another in the case of an outage that requires us to migrate a large > number of FlinkDeployments from one K8s cluster to another. > > I understand one way to do this is to set `initialSavepoint` on all the > FlinkDeployments to the most recent/appropriate snapshot so the jobs > continue from where they left off, but for a large number of jobs, this > would be quite a bit of manual labor. > > Do others have an approach they are using? Any advice? > > Could this be something addressed in a future FLIP? Perhaps we could store > some kind of metadata in object storage so that the Flink Kubernetes > Operator can restore a FlinkDeployment from where it left off, even if the > job is shifted to another Kubernetes Cluster. > > Looking forward to hearing folks' thoughts! >