Re: Flink on Kubernetes - Session vs Job cluster mode and storage

Hao Sun Fri, 28 Feb 2020 09:14:42 -0800

Sounds good. Thank you!

Hao Sun



On Thu, Feb 27, 2020 at 6:52 PM Yang Wang <danrtsey...@gmail.com> wrote:

> Hi Hao Sun,
>
> I just post the explanation to the user ML so that others could also have
> the same problem.
>
> Gven the job graph is fetched from the jar, do we still need Zookeeper for
>> HA? Maybe we still need it for checkpoint locations?
>
>
> Yes, we still need the zookeeper(maybe in the future we will have a native
> K8s HA based on etcd) for the complete recovery. You
> are right. We still need it for finding the checkpoint locations. Also the
> Zookeeper will be used for leader election and leader retriever.
>
>
> Best,
> Yang
>
> Hao Sun <ha...@zendesk.com> 于2020年2月28日周五 上午1:41写道：
>
>> Hi Yang, given the job graph is fetched from the jar, do we still need
>> Zookeeper for HA? Maybe we still need it for checkpoint locations?
>>
>> Hao Sun
>>
>>
>> On Thu, Feb 27, 2020 at 5:13 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>
>>> Hi Jin Yi,
>>>
>>> For standalone per-job cluster, it is a little different about the
>>> recovery.
>>> Just as you say, the user jar has built in the image, when the
>>> JobManager failed
>>> and relaunched by the K8s, the user `main()` will be executed again to
>>> get the
>>> job graph, not like session cluster to get the job graph from
>>> high-availability storage.
>>> Then the job will be submitted again and the job could recover from the
>>> latest
>>> checkpoint(assume that you have configured the high-availability).
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Jin Yi <eleanore....@gmail.com> 于2020年2月27日周四 下午2:50写道：
>>>
>>>> Hi Yang,
>>>>
>>>> regarding your statement below:
>>>>
>>>> Since you are starting JM/TM with K8s deployment, when they failed new
>>>> JM/TM will be created. If you do not set the high
>>>> availability configuration, your jobs could recover when TM failed.
>>>> However, they could not recover when JM failed. With HA
>>>> configured, the jobs could always be recovered and you do not need to
>>>> re-submit again.
>>>>
>>>> Does it also apply to Flink Job Cluster? When the JM pod restarted by
>>>> Kubernetes, the image contains the application jar also, so if the
>>>> statement also applies to the Flink Job Cluster mode, can you please
>>>> elaborate why?
>>>>
>>>> Thanks a lot!
>>>> Eleanore
>>>>
>>>> On Mon, Feb 24, 2020 at 6:36 PM Yang Wang <danrtsey...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi M Singh,
>>>>>
>>>>> > Mans - If we use the session based deployment option for K8 - I
>>>>>> thought K8 will automatically restarts any failed TM or JM.
>>>>>> In the case of failed TM - the job will probably recover, but in the
>>>>>> case of failed JM - perhaps we need to resubmit all jobs.
>>>>>> Let me know if I have misunderstood anything.
>>>>>
>>>>>
>>>>> Since you are starting JM/TM with K8s deployment, when they failed new
>>>>> JM/TM will be created. If you do not set the high
>>>>> availability configuration, your jobs could recover when TM failed.
>>>>> However, they could not recover when JM failed. With HA
>>>>> configured, the jobs could always be recovered and you do not need to
>>>>> re-submit again.
>>>>>
>>>>> > Mans - Is there any safe way of a passing creds ?
>>>>>
>>>>>
>>>>> Yes, you are right, Using configmap to pass the credentials is not
>>>>> safe. On K8s, i think you could use secrets instead[1].
>>>>>
>>>>> > Mans - Does a task manager failure cause the job to fail ?  My
>>>>>> understanding is the JM failure are catastrophic while TM failures are
>>>>>> recoverable.
>>>>>
>>>>>
>>>>> What i mean is the job failed, and it could be restarted by your
>>>>> configured restart strategy[2].
>>>>>
>>>>> > Mans - So if we are saving checkpoint in S3 then there is no need
>>>>>> for disks - should we use emptyDir ?
>>>>>
>>>>>
>>>>> Yes, if you are saving the checkpoint in S3 and also set the
>>>>> `high-availability.storageDir` to S3. Then you do not need persistent
>>>>> volume. Since
>>>>> the local directory is only used for local cache, so you could
>>>>> directly use the overlay filesystem or empryDir(better io performance).
>>>>>
>>>>>
>>>>> [1].
>>>>> https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/
>>>>> <https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure>
>>>>> [2].
>>>>> https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#fault-tolerance
>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#fault-tolerance>
>>>>>
>>>>> M Singh <mans2si...@yahoo.com> 于2020年2月25日周二 上午5:53写道：
>>>>>
>>>>>> Thanks Wang for your detailed answers.
>>>>>>
>>>>>> From what I understand the native_kubernetes also leans towards
>>>>>> creating a session and submitting a job to it.
>>>>>>
>>>>>> Regarding other recommendations, please my inline comments and advice.
>>>>>>
>>>>>> On Sunday, February 23, 2020, 10:01:10 PM EST, Yang Wang <
>>>>>> danrtsey...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Singh,
>>>>>>
>>>>>> Glad to hear that you are looking to run Flink on the Kubernetes. I am
>>>>>> trying to answer your question based on my limited knowledge and
>>>>>> others could correct me and add some more supplements.
>>>>>>
>>>>>> I think the biggest difference between session cluster and per-job
>>>>>> cluster
>>>>>> on Kubernetesis the isolation. Since for per-job, a dedicated Flink
>>>>>> cluster
>>>>>> will be started for the only one job and no any other jobs could be
>>>>>> submitted.
>>>>>> Once the job is finished, then the Flink cluster will be
>>>>>> destroyed immediately.
>>>>>> The second point is one-step submission. You do not need to start a
>>>>>> Flink
>>>>>> cluster first and then submit a job to the existing session.
>>>>>>
>>>>>> > Are there any benefits with regards to
>>>>>> 1. Configuring the jobs
>>>>>> No matter you are using the per-job cluster or submitting to the
>>>>>> existing
>>>>>> session cluster, they share the configuration mechanism. You do not
>>>>>> have
>>>>>> to change any codes and configurations.
>>>>>>
>>>>>> 2. Scaling the taskmanager
>>>>>> Since you are using the Standalone cluster on Kubernetes, it do not
>>>>>> provide
>>>>>> an active resourcemanager. You need to use external tools to monitor
>>>>>> and
>>>>>> scale up the taskmanagers. The active integration is still evolving
>>>>>> and you
>>>>>> could have a taste[1].
>>>>>>
>>>>>> Mans - If we use the session based deployment option for K8 - I
>>>>>> thought K8 will automatically restarts any failed TM or JM.
>>>>>> In the case of failed TM - the job will probably recover, but in the
>>>>>> case of failed JM - perhaps we need to resubmit all jobs.
>>>>>> Let me know if I have misunderstood anything.
>>>>>>
>>>>>> 3. Restarting jobs
>>>>>> For the session cluster, you could directly cancel the job and
>>>>>> re-submit. And
>>>>>> for per-job cluster, when the job is canceled, you need to start a
>>>>>> new per-job
>>>>>> cluster from the latest savepoint.
>>>>>>
>>>>>> 4. Managing the flink jobs
>>>>>> The rest api and flink command line could be used to managing the
>>>>>> jobs(e.g.
>>>>>> flink cancel, etc.). I think there is no difference for session and
>>>>>> per-job here.
>>>>>>
>>>>>> 5. Passing credentials (in case of AWS, etc)
>>>>>> I am not sure how do you provide your credentials. If you put them
>>>>>> in the
>>>>>> config map and then mount into the jobmanager/taskmanager pod, then
>>>>>> both
>>>>>> session and per-job could support this way.
>>>>>>
>>>>>> Mans - Is there any safe way of a passing creds ?
>>>>>>
>>>>>> 6. Fault tolerence and recovery of jobs from failure
>>>>>> For session cluster, if one taskmanager crashed, then all the jobs
>>>>>> which have tasks
>>>>>> on this taskmanager will failed.
>>>>>> Both session and per-job could be configured with high availability
>>>>>> and recover
>>>>>> from the latest checkpoint.
>>>>>>
>>>>>> Mans - Does a task manager failure cause the job to fail ?  My
>>>>>> understanding is the JM failure are catastrophic while TM failures are
>>>>>> recoverable.
>>>>>>
>>>>>> > Is there any need for specifying volume for the pods?
>>>>>> No, you do not need to specify a volume for pod. All the data in the
>>>>>> pod
>>>>>> local directory is temporary. When a pod crashed and relaunched, the
>>>>>> taskmanager will retrieve the checkpoint from zookeeper + S3 and
>>>>>> resume
>>>>>> from the latest checkpoint.
>>>>>>
>>>>>> Mans - So if we are saving checkpoint in S3 then there is no need for
>>>>>> disks - should we use emptyDir ?
>>>>>>
>>>>>>
>>>>>> [1].
>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html
>>>>>> <https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html>
>>>>>>
>>>>>> M Singh <mans2si...@yahoo.com> 于2020年2月23日周日 上午2:28写道：
>>>>>>
>>>>>> Hey Folks:
>>>>>>
>>>>>> I am trying to figure out the options for running Flink on Kubernetes
>>>>>> and am trying to find out the pros and cons of running in Flink Session 
>>>>>> vs
>>>>>> Flink Cluster mode (
>>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html#flink-session-cluster-on-kubernetes
>>>>>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html#flink-session-cluster-on-kubernetes>).
>>>>>>
>>>>>> I understand that in job mode there is no need to submit the job
>>>>>> since it is part of the job image.  But what are other the pros and cons 
>>>>>> of
>>>>>> this approach vs session mode where a job manager is deployed and flink
>>>>>> jobs can be submitted it ?  Are there any benefits with regards to:
>>>>>>
>>>>>> 1. Configuring the jobs
>>>>>> 2. Scaling the taskmanager
>>>>>> 3. Restarting jobs
>>>>>> 4. Managing the flink jobs
>>>>>> 5. Passing credentials (in case of AWS, etc)
>>>>>> 6. Fault tolerence and recovery of jobs from failure
>>>>>>
>>>>>> Also, we will be keeping the checkpoints for the jobs on S3.  Is
>>>>>> there any need for specifying volume for the pods ?  If volume is 
>>>>>> required
>>>>>> do we need provisioned volume and what are the recommended
>>>>>> alternatives/considerations especially with AWS.
>>>>>>
>>>>>> If there are any other considerations, please let me know.
>>>>>>
>>>>>> Thanks for your advice.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>

Re: Flink on Kubernetes - Session vs Job cluster mode and storage

Reply via email to