Flink High-Availability and Job-Manager recovery

Koffman, Noa (Nokia - IL/Kfar Sava) Thu, 03 Feb 2022 06:11:58 -0800

Hi all,
We are currently deploying flink on k8s 3 nodes cluster - with 1 job-manager 
and 3 task managers
We are trying to understand the recommendation for deployment, more 
specifically for recovery from job-manager failure, and have some questions 
about that:



  1.  If we use flink HA solution (either Kubernetes-HA or zookeeper), the 
documentation states we should define the ‘high-availability.storageDir

In the examples we found, there is mostly hdfs or s3 storage.

We were wondering if we could use Kubernetes PersistentVolumes and 
PersistentVolumeClaims, if we do use that, can each job-manager have its own 
volume? Or it must be shared?

  1.  Is there a solution for jobmanager recovery without HA? With the way our 
flink is currenly configured, killing the job-manager pod, all the jobs are 
lost.

Is there a way to configure the job-manager so that if it goes down and k8s 
restarts it, it will continue from the same state (restart all the tasks, etc…)?

For this, can a Persistent Volume be used, without HDFS or external solutions?

  1.  Regarding the deployment mode: we are working with beam + flink, and 
flink is running in session mode, we have a few long running streaming 
pipelines deployed (less then 10).

Is ‘session’ mode the right deployment mode for our type of deployment? Or 
should we consider switching to something different? (Per-job/application)



Thanks

Flink High-Availability and Job-Manager recovery

Reply via email to