First, JobManager does not store any persistent data to local when the
Kubernetes HA + S3 used.
It means that you do not need to mount a PV for JobMananger deployment.

Secondly, node failures or terminations should not cause
the CrashLoopBackOff status.
One possible reason I could imagine is a bug FLINK-28265[1], which is fixed
in 1.15.3.

BTW, it will be great if you could share the logs of initial JobManager pod
and crashed JobManager pod.

[1]. https://issues.apache.org/jira/browse/FLINK-28265


Best,
Yang


Vijay Jammi <vjammi.apa...@gmail.com> 于2023年1月6日周五 04:24写道:

> Hi,
>
> Have a query on the Job Manager HA for flink 1.15.
>
> We currently run a standalone flink cluster with a single JobManager and
> multiple TaskManagers, deployed on top of a kubernetes cluster (EKS
> cluster) in application mode (reactive mode).
>
> The Task Managers are deployed as a ReplicaSet and the single Job Manager
> is configured to be highly available using the Kubernetes HA services with
> recovery data being written to S3.
>       high-availability.storageDir:
> s3://<bucket-name>/flink/<app-name>/recovery
>
> We also have configured our cluster for the rocksdb state backend with
> checkpoints being written to S3.
>       state.backend: rocksdb
>       state.checkpoints.dir:
> s3://<bucket-name>/flink/<app-name>/checkpoints
>
> Now to test the Job Manager HA, when we delete the job manager deployment
> (to simulate job manager crash), we see that Kubernetes (EKS) detects
> the failure, launches a new Job Manager pod and is able to recover the
> application cluster from the last successful checkpoint (Restoring job
> 000....0000 from Checkpoint 5 @ 167...3692 for 000....0000 located at
> s3://.../checkpoints/00000...0000/chk-5).
>
> However, if we terminate the underlying node (EC2 instance) on which the
> Job Manager pod is scheduled, the cluster is unable to recover from this
> scenario. What we are seeing is that Kubernetes as usual tries and retries
> repeatedly to launch a newer Job Manager but this time the job manager is
> unable to find the checkpoint to recover from (No checkpoint found during
> restore), eventually going into a CrashLoopBackOff status after max
> attempts of restart.
>
> Now the query is will the Job Manager need to be configured to store its
> state to a local working directory over persistent volumes? Any pointers on
> how we can recover the cluster from such node failures or terminations?
>
> Vijay Jammi
>

Reply via email to