First, JobManager does not store any persistent data to local when the Kubernetes HA + S3 used. It means that you do not need to mount a PV for JobMananger deployment.
Secondly, node failures or terminations should not cause the CrashLoopBackOff status. One possible reason I could imagine is a bug FLINK-28265[1], which is fixed in 1.15.3. BTW, it will be great if you could share the logs of initial JobManager pod and crashed JobManager pod. [1]. https://issues.apache.org/jira/browse/FLINK-28265 Best, Yang Vijay Jammi <vjammi.apa...@gmail.com> 于2023年1月6日周五 04:24写道: > Hi, > > Have a query on the Job Manager HA for flink 1.15. > > We currently run a standalone flink cluster with a single JobManager and > multiple TaskManagers, deployed on top of a kubernetes cluster (EKS > cluster) in application mode (reactive mode). > > The Task Managers are deployed as a ReplicaSet and the single Job Manager > is configured to be highly available using the Kubernetes HA services with > recovery data being written to S3. > high-availability.storageDir: > s3://<bucket-name>/flink/<app-name>/recovery > > We also have configured our cluster for the rocksdb state backend with > checkpoints being written to S3. > state.backend: rocksdb > state.checkpoints.dir: > s3://<bucket-name>/flink/<app-name>/checkpoints > > Now to test the Job Manager HA, when we delete the job manager deployment > (to simulate job manager crash), we see that Kubernetes (EKS) detects > the failure, launches a new Job Manager pod and is able to recover the > application cluster from the last successful checkpoint (Restoring job > 000....0000 from Checkpoint 5 @ 167...3692 for 000....0000 located at > s3://.../checkpoints/00000...0000/chk-5). > > However, if we terminate the underlying node (EC2 instance) on which the > Job Manager pod is scheduled, the cluster is unable to recover from this > scenario. What we are seeing is that Kubernetes as usual tries and retries > repeatedly to launch a newer Job Manager but this time the job manager is > unable to find the checkpoint to recover from (No checkpoint found during > restore), eventually going into a CrashLoopBackOff status after max > attempts of restart. > > Now the query is will the Job Manager need to be configured to store its > state to a local working directory over persistent volumes? Any pointers on > how we can recover the cluster from such node failures or terminations? > > Vijay Jammi >