one thing you could consider is a mutator that detects when a failover is happening, and then updates the CR to point to the right snapshot to restore from.
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.12/docs/operations/plugins/#custom-flink-resource-mutators On Thu, Jun 12, 2025 at 11:00 AM Pedro Mázala <pedroh.maz...@gmail.com> wrote: > Got it. > > I'd say what you want to achieve is something like having a > "latest-savepoint" symlink always pointing to the latest written savepoint > file, and always starting from that. > > To achieve this, you'd require some manual work. IIRC, you cannot set a > jobID via the operator. > > > > > Att, > Pedro Mázala > Be awesome > > > On Thu, 12 Jun 2025 at 16:48, gustavo panizzo <g...@zumbi.com.ar> wrote: > >> hello >> >> that would indeed work, but it requires knowing in advance the last jobID >> for that particular job and changing the spec submitted to the destination >> cluster. we aim to have 0 touch job failover from k8s cluster to k8s cluster >> >> our cluster are multi node, multi az but they run critical business >> process hence we want to protect against region failure >> >> On Thu, Jun 12, 2025, at 4:30 PM, Pedro Mázala wrote: >> >> Using Flink k8s operator, you may use the yaml property >> job.initialSavepointPath to set a path that you want to start your pipeline >> from. This would be the full path. Including the jobid. And then, you'll >> have the new ID generated and such. >> >> To avoid maintenance issues like this one, a multi-node cluster may help >> you. k8s will try to spread the deployments among the different nodes. Even >> if one dies, it will make sure everything is there due to k8s desired state >> mechanism. >> >> >> >> Att, >> Pedro Mázala >> Be awesome >> >> >> On Thu, 12 Jun 2025 at 15:52, gustavo panizzo <g...@zumbi.com.ar> wrote: >> >> Hello >> >> I run flink (v 1.20) on k8s using the native integration and the k8s >> operator (v 1.30), we keep savepoints and checkpoints in S3. >> >> We'd like to be able to continue running the same jobs (with the same >> config, same image, using the same sink and sources, connecting to kafka >> using the same credentials and groups, restoring the state from were the >> previous job left) from another k8s cluster in the event of maintenance or >> simply failure of the k8s cluster, hence we need to restore the state from >> a savepoint or checkpoint. >> >> however the problem we face is that the jobID is is part of the path >> where checkpoints and savepoints are stored in S3 and it is generated >> dynamically every time a job (kind: flinkdeployments) is deployed into k8s >> >> So i cannot re create the same job in another k8s cluster to pick up >> where the previous job left >> >> I could copy file around in S3 but feels racy and not really great, how >> others move stateful jobs from k8s clusters to other k8s clusters? >> >> >> cheers >> >> >>