Re: Application cluster - Job execution and cluster creation timeouts

Tamir Sagi Tue, 06 Apr 2021 03:43:29 -0700

Hey Yang

Thank you for your respond


We run the application cluster programmatically.

I discussed about it here with an example how to run it from java and not CLI.
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Application-cluster-Best-Practice-td42011.html

following your comment
accessibility of Flink rest endpoint. When it is not ready in the timeout(e.g. 
120s), the deployer will delete the Flink JobManager deployment and try to 
create a new one.
I have not seen it in action actually, I gave a non-existing image . The 
deployer actually started the k8s deployment but pods failed to start(expected) 
, The k8s deployment was running infinite.

What configuration is that ? is it possible to override it ?

I delved into the Flink-Core, and Flink-Kubernetes jars, Since Flink is 
dependent on Kubernetes ,  we both need to leverage the Kubernetes client(which 
Flink does internally) to manage and inspecting the resources.
I am curious why you have "infinite job execution" in your Flink  application 
cluster. If all the jobs in the application finished, Flink will
deregister the application and all the K8s resources should be cleaned up.
My though was about what happens if there is a bug and the job running 
infinite, job manager crashes over and over again?
What happens if resources don't get cleaned properly ? We don't want to keep 
the cluster up and running in that case and would like to get a feedback. Since 
Flink does not support that we have to inspect that externally.(which makes it 
more complex)
We could also pull the job status using Flink client, but it become useless if 
the job is executed infinite.

What do you think?

Best,
Tamir.


[https://my-email-signature.link/signature.gif?u=1088647&e=145530340&v=5500b7f1f0cbfd289d5f3053790ae0e36932941ce59f5ce3694a2ae0a6341dcd]
________________________________
From: Yang Wang <danrtsey...@gmail.com>
Sent: Tuesday, April 6, 2021 10:36 AM
To: Tamir Sagi <tamir.s...@niceactimize.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Application cluster - Job execution and cluster creation timeouts


EXTERNAL EMAIL


Hi Tamir,

Thanks for trying the native K8s integration.

1. We do not have a timeout for creating the Flink application cluster. The 
reason is that the job submission happens on the JobManager side.
So the Flink client does not need to wait for the JobManager running and then 
exit.

I think even though the Flink client internally has the timeout, we still have 
the same problem when the Flink client crashes and then the timeout is
gone.

I want to share some other solution about the timeout. In our deployer, when a 
new Flink application is created, the deployer will periodically check the
accessibility of Flink rest endpoint. When it is not ready in the timeout(e.g. 
120s), the deployer will delete the Flink JobManager deployment and try to
create a new one.

2. Actually, the current "flink run-application" does not support the real 
attached mode(waiting for all the jobs in the application finished).
I am curious why you have "infinite job execution" in your Flink  application 
cluster. If all the jobs in the application finished, Flink will
deregister the application and all the K8s resources should be cleaned up.


Best,
Yang


Tamir Sagi <tamir.s...@niceactimize.com<mailto:tamir.s...@niceactimize.com>> 
于2021年4月5日周一 下午11:24写道：
Hey all,

We deploy application cluster natively on Kubernetes.

are there any timeouts for Job execution and cluster creation?

I went over the configuration page 
here<https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html>
  but did not find anything relevant.

In order to get an indication about the cluster , we leverage the k8s 
client<https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#pods>
 to watch the 
deployment<https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#deployment%23:~:text=Watching%20a%20Deployment%3A>
 in a namespace with specific cluster name and respond accordingly.

we define two timeouts

  1.  Creating the application cluster (i.e. to date if there are errors in 
pods, the k8s deployment is up but the application cluster is not running.)
  2.  Until the application cluster resources get cleaned(upon completion)  - 
which prevent an infinite job execution or k8s glitches

However,  this solution is not ideal because in case this client lib crashes, 
the timeouts are gone.
We don't want to manage these timeouts states ourselves.

Any suggestion or better way?

Thanks,
Tamir.




[https://my-email-signature.link/signature.gif?u=1088647&e=145346582&v=3f32b726c93b8d93869d4a1520a346f1c12902a66bd38eb48abc091003335147]

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

Re: Application cluster - Job execution and cluster creation timeouts

Reply via email to