Re: Application cluster - Job execution and cluster creation timeouts

Yang Wang Tue, 06 Apr 2021 00:38:17 -0700

Hi Tamir,

Thanks for trying the native K8s integration.


1. We do not have a timeout for creating the Flink application cluster. The
reason is that the job submission happens on the JobManager side.
So the Flink client does not need to wait for the JobManager running and
then exit.

I think even though the Flink client internally has the timeout, we still
have the same problem when the Flink client crashes and then the timeout is
gone.

I want to share some other solution about the timeout. In our deployer,
when a new Flink application is created, the deployer will periodically
check the
accessibility of Flink rest endpoint. When it is not ready in the
timeout(e.g. 120s), the deployer will delete the Flink JobManager
deployment and try to
create a new one.

2. Actually, the current "flink run-application" does not support the real
attached mode(waiting for all the jobs in the application finished).
I am curious why you have "infinite job execution" in your Flink
application cluster. If all the jobs in the application finished, Flink will
deregister the application and all the K8s resources should be cleaned up.


Best,
Yang


Tamir Sagi <tamir.s...@niceactimize.com> 于2021年4月5日周一 下午11:24写道：

> Hey all,
>
> We deploy application cluster natively on Kubernetes.
>
> are there any timeouts for Job execution and cluster creation?
>
> I went over the configuration page here
> <https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html>
> but did not find anything relevant.
>
> In order to get an indication about the cluster , we leverage the k8s
> client
> <https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#pods>
>  to watch the deployment
> <https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#deployment%23:~:text=Watching%20a%20Deployment%3A>
>  in a namespace with specific cluster name and respond accordingly.
>
> we define two timeouts
>
>    1. Creating the application cluster (i.e. to date if there are errors
>    in pods, the k8s deployment is up but the application cluster is not
>    running.)
>    2. Until the application cluster resources get cleaned(upon
>    completion)  - which prevent an infinite job execution or k8s glitches
>
>
> However,  this solution is not ideal because in case this client lib
> crashes, the timeouts are gone.
> We don't want to manage these timeouts states ourselves.
>
> Any suggestion or better way?
>
> Thanks,
> Tamir.
>
>
>
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>

Re: Application cluster - Job execution and cluster creation timeouts

Reply via email to