Re: Resume running a statefull in a different k8s cluster

2025-06-13 Thread Kevin Lam via user
one thing you could consider is a mutator that detects when a failover is happening, and then updates the CR to point to the right snapshot to restore from. https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.12/docs/operations/plugins/#custom-flink-resource-mutators On Th

Re: Resume running a statefull in a different k8s cluster

2025-06-12 Thread gustavo panizzo
hello that would indeed work, but it requires knowing in advance the last jobID for that particular job and changing the spec submitted to the destination cluster. we aim to have 0 touch job failover from k8s cluster to k8s cluster our cluster are multi node, multi az but they run critical busi

Re: Resume running a statefull in a different k8s cluster

2025-06-12 Thread Pedro Mázala
Got it. I'd say what you want to achieve is something like having a "latest-savepoint" symlink always pointing to the latest written savepoint file, and always starting from that. To achieve this, you'd require some manual work. IIRC, you cannot set a jobID via the operator. Att, Pedro Mázala

Re: Resume running a statefull in a different k8s cluster

2025-06-12 Thread Pedro Mázala
Using Flink k8s operator, you may use the yaml property job.initialSavepointPath to set a path that you want to start your pipeline from. This would be the full path. Including the jobid. And then, you'll have the new ID generated and such. To avoid maintenance issues like this one, a multi-node c

Resume running a statefull in a different k8s cluster

2025-06-12 Thread gustavo panizzo
Hello I run flink (v 1.20) on k8s using the native integration and the k8s operator (v 1.30), we keep savepoints and checkpoints in S3. We'd like to be able to continue running the same jobs (with the same config, same image, using the same sink and sources, connecting to kafka using the same