Re: Flink High-Availability and Job-Manager recovery

bastien dine Fri, 04 Feb 2022 00:56:36 -0800

Hello,

On k8s the current recommendation is to set up 1 job manager with H-A
enabled, so that cluster do not lost state upon crash


1. The storage dir can for sure be on kube PV, the directory should be
shared within all JM, you will need to map the volume to the same local
directory (e.g /data) so that the configuration amongst JM is the same
2. You can have only 1 JM, but you still need to enabled HA, since HA will
write the cluster state into ZK & storage dir
3. I don't know anything about beam, so I can not help you with that,
But per-job mode will not be available on k8s (neither native nor
standalone kube)
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#per-job-mode
&
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/#per-job-mode,
you will need YARN to do so (i think MESOS is deprecated)
Application mode can be a bit tricky to understand, it will "move" the
submit of the job inside the JM
The chosen solution will depends on your deployment needs, I can't tell you
without knowing more,
But going into session mode + streaming job deployment is pretty standard
and you can easily emulate "one cluster per job" with it (better for ops &
tuning matters than a cluster with multiple jobs)

Hope this can help,
Regards,
Bastien


Le jeu. 3 févr. 2022 à 15:12, Koffman, Noa (Nokia - IL/Kfar Sava) <
noa.koff...@nokia.com> a écrit :

> Hi all,
>
> We are currently deploying flink on k8s 3 nodes cluster - with 1
> job-manager and 3 task managers
>
> We are trying to understand the recommendation for deployment, more
> specifically for recovery from job-manager failure, and have some questions
> about that:
>
>
>
>    1. If we use flink HA solution (either Kubernetes-HA or zookeeper),
>    the documentation states we should define the ‘high-availability.storageDir
>
> In the examples we found, there is mostly hdfs or s3 storage.
>
> We were wondering if we could use Kubernetes PersistentVolumes and
> PersistentVolumeClaims, if we do use that, can each job-manager have its
> own volume? Or it must be shared?
>
>    1. Is there a solution for jobmanager recovery without HA? With the
>    way our flink is currenly configured, killing the job-manager pod, all the
>    jobs are lost.
>
> Is there a way to configure the job-manager so that if it goes down and
> k8s restarts it, it will continue from the same state (restart all the
> tasks, etc…)?
>
> For this, can a Persistent Volume be used, without HDFS or external
> solutions?
>
>    1. Regarding the deployment mode: we are working with beam + flink,
>    and flink is running in session mode, we have a few long running streaming
>    pipelines deployed (less then 10).
>
> Is ‘session’ mode the right deployment mode for our type of deployment? Or
> should we consider switching to something different? (Per-job/application)
>
>
>
> Thanks
>
>
>
>
>
>
>
>
>

Re: Flink High-Availability and Job-Manager recovery

Reply via email to