Hi all, We are currently deploying flink on k8s 3 nodes cluster - with 1 job-manager and 3 task managers We are trying to understand the recommendation for deployment, more specifically for recovery from job-manager failure, and have some questions about that:
1. If we use flink HA solution (either Kubernetes-HA or zookeeper), the documentation states we should define the ‘high-availability.storageDir In the examples we found, there is mostly hdfs or s3 storage. We were wondering if we could use Kubernetes PersistentVolumes and PersistentVolumeClaims, if we do use that, can each job-manager have its own volume? Or it must be shared? 1. Is there a solution for jobmanager recovery without HA? With the way our flink is currenly configured, killing the job-manager pod, all the jobs are lost. Is there a way to configure the job-manager so that if it goes down and k8s restarts it, it will continue from the same state (restart all the tasks, etc…)? For this, can a Persistent Volume be used, without HDFS or external solutions? 1. Regarding the deployment mode: we are working with beam + flink, and flink is running in session mode, we have a few long running streaming pipelines deployed (less then 10). Is ‘session’ mode the right deployment mode for our type of deployment? Or should we consider switching to something different? (Per-job/application) Thanks