Re: Resume running a statefull in a different k8s cluster

Pedro Mázala Thu, 12 Jun 2025 07:59:42 -0700

Got it.

I'd say what you want to achieve is something like having a
"latest-savepoint" symlink always pointing to the latest written savepoint
file, and always starting from that.


To achieve this, you'd require some manual work. IIRC, you cannot set a
jobID via the operator.




Att,
Pedro Mázala
Be awesome


On Thu, 12 Jun 2025 at 16:48, gustavo panizzo <g...@zumbi.com.ar> wrote:

> hello
>
> that would indeed work, but it requires knowing in advance the last jobID
> for that particular job and changing the spec submitted to the destination
> cluster. we aim to have 0 touch job failover from k8s cluster to k8s cluster
>
> our cluster are multi node, multi az but they run critical business
> process hence we want to protect against region failure
>
> On Thu, Jun 12, 2025, at 4:30 PM, Pedro Mázala wrote:
>
> Using Flink k8s operator, you may use the yaml property
> job.initialSavepointPath to set a path that you want to start your pipeline
> from. This would be the full path. Including the jobid. And then, you'll
> have the new ID generated and such.
>
> To avoid maintenance issues like this one, a multi-node cluster may help
> you. k8s will try to spread the deployments among the different nodes. Even
> if one dies, it will make sure everything is there due to k8s desired state
> mechanism.
>
>
>
> Att,
> Pedro Mázala
> Be awesome
>
>
> On Thu, 12 Jun 2025 at 15:52, gustavo panizzo <g...@zumbi.com.ar> wrote:
>
> Hello
>
> I run flink (v 1.20) on k8s using the native integration and the k8s
> operator (v 1.30), we keep savepoints and checkpoints in S3.
>
> We'd like to be able to continue running the same jobs (with the same
> config, same image, using the same sink and sources, connecting to kafka
> using the same credentials and groups, restoring the state from were the
> previous job left) from another k8s cluster in the event of maintenance or
> simply failure of the k8s cluster, hence we need to restore the state from
> a savepoint or checkpoint.
>
> however the problem we face is that the jobID is is part of the path where
> checkpoints and savepoints are stored in S3 and it is generated dynamically
> every time a job (kind: flinkdeployments) is deployed into k8s
>
> So i cannot re create the same job in another k8s cluster to pick up where
> the previous job left
>
> I could copy file around in S3 but feels racy and not really great, how
> others move stateful jobs from k8s clusters to other k8s clusters?
>
>
> cheers
>
>
>

Re: Resume running a statefull in a different k8s cluster

Reply via email to