Re: Resume running a statefull in a different k8s cluster

Kevin Lam via user Fri, 13 Jun 2025 08:40:57 -0700

one thing you could consider is a mutator that detects when a failover is
happening, and then updates the CR to point to the right snapshot to
restore from.


https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.12/docs/operations/plugins/#custom-flink-resource-mutators

On Thu, Jun 12, 2025 at 11:00 AM Pedro Mázala <pedroh.maz...@gmail.com>
wrote:

> Got it.
>
> I'd say what you want to achieve is something like having a
> "latest-savepoint" symlink always pointing to the latest written savepoint
> file, and always starting from that.
>
> To achieve this, you'd require some manual work. IIRC, you cannot set a
> jobID via the operator.
>
>
>
>
> Att,
> Pedro Mázala
> Be awesome
>
>
> On Thu, 12 Jun 2025 at 16:48, gustavo panizzo <g...@zumbi.com.ar> wrote:
>
>> hello
>>
>> that would indeed work, but it requires knowing in advance the last jobID
>> for that particular job and changing the spec submitted to the destination
>> cluster. we aim to have 0 touch job failover from k8s cluster to k8s cluster
>>
>> our cluster are multi node, multi az but they run critical business
>> process hence we want to protect against region failure
>>
>> On Thu, Jun 12, 2025, at 4:30 PM, Pedro Mázala wrote:
>>
>> Using Flink k8s operator, you may use the yaml property
>> job.initialSavepointPath to set a path that you want to start your pipeline
>> from. This would be the full path. Including the jobid. And then, you'll
>> have the new ID generated and such.
>>
>> To avoid maintenance issues like this one, a multi-node cluster may help
>> you. k8s will try to spread the deployments among the different nodes. Even
>> if one dies, it will make sure everything is there due to k8s desired state
>> mechanism.
>>
>>
>>
>> Att,
>> Pedro Mázala
>> Be awesome
>>
>>
>> On Thu, 12 Jun 2025 at 15:52, gustavo panizzo <g...@zumbi.com.ar> wrote:
>>
>> Hello
>>
>> I run flink (v 1.20) on k8s using the native integration and the k8s
>> operator (v 1.30), we keep savepoints and checkpoints in S3.
>>
>> We'd like to be able to continue running the same jobs (with the same
>> config, same image, using the same sink and sources, connecting to kafka
>> using the same credentials and groups, restoring the state from were the
>> previous job left) from another k8s cluster in the event of maintenance or
>> simply failure of the k8s cluster, hence we need to restore the state from
>> a savepoint or checkpoint.
>>
>> however the problem we face is that the jobID is is part of the path
>> where checkpoints and savepoints are stored in S3 and it is generated
>> dynamically every time a job (kind: flinkdeployments) is deployed into k8s
>>
>> So i cannot re create the same job in another k8s cluster to pick up
>> where the previous job left
>>
>> I could copy file around in S3 but feels racy and not really great, how
>> others move stateful jobs from k8s clusters to other k8s clusters?
>>
>>
>> cheers
>>
>>
>>

Re: Resume running a statefull in a different k8s cluster

Reply via email to