Hey Yang Thank you for your respond
We run the application cluster programmatically. I discussed about it here with an example how to run it from java and not CLI. http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Application-cluster-Best-Practice-td42011.html following your comment accessibility of Flink rest endpoint. When it is not ready in the timeout(e.g. 120s), the deployer will delete the Flink JobManager deployment and try to create a new one. I have not seen it in action actually, I gave a non-existing image . The deployer actually started the k8s deployment but pods failed to start(expected) , The k8s deployment was running infinite. What configuration is that ? is it possible to override it ? I delved into the Flink-Core, and Flink-Kubernetes jars, Since Flink is dependent on Kubernetes , we both need to leverage the Kubernetes client(which Flink does internally) to manage and inspecting the resources. I am curious why you have "infinite job execution" in your Flink application cluster. If all the jobs in the application finished, Flink will deregister the application and all the K8s resources should be cleaned up. My though was about what happens if there is a bug and the job running infinite, job manager crashes over and over again? What happens if resources don't get cleaned properly ? We don't want to keep the cluster up and running in that case and would like to get a feedback. Since Flink does not support that we have to inspect that externally.(which makes it more complex) We could also pull the job status using Flink client, but it become useless if the job is executed infinite. What do you think? Best, Tamir. [https://my-email-signature.link/signature.gif?u=1088647&e=145530340&v=5500b7f1f0cbfd289d5f3053790ae0e36932941ce59f5ce3694a2ae0a6341dcd] ________________________________ From: Yang Wang <danrtsey...@gmail.com> Sent: Tuesday, April 6, 2021 10:36 AM To: Tamir Sagi <tamir.s...@niceactimize.com> Cc: user@flink.apache.org <user@flink.apache.org> Subject: Re: Application cluster - Job execution and cluster creation timeouts EXTERNAL EMAIL Hi Tamir, Thanks for trying the native K8s integration. 1. We do not have a timeout for creating the Flink application cluster. The reason is that the job submission happens on the JobManager side. So the Flink client does not need to wait for the JobManager running and then exit. I think even though the Flink client internally has the timeout, we still have the same problem when the Flink client crashes and then the timeout is gone. I want to share some other solution about the timeout. In our deployer, when a new Flink application is created, the deployer will periodically check the accessibility of Flink rest endpoint. When it is not ready in the timeout(e.g. 120s), the deployer will delete the Flink JobManager deployment and try to create a new one. 2. Actually, the current "flink run-application" does not support the real attached mode(waiting for all the jobs in the application finished). I am curious why you have "infinite job execution" in your Flink application cluster. If all the jobs in the application finished, Flink will deregister the application and all the K8s resources should be cleaned up. Best, Yang Tamir Sagi <tamir.s...@niceactimize.com<mailto:tamir.s...@niceactimize.com>> 于2021年4月5日周一 下午11:24写道: Hey all, We deploy application cluster natively on Kubernetes. are there any timeouts for Job execution and cluster creation? I went over the configuration page here<https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html> but did not find anything relevant. In order to get an indication about the cluster , we leverage the k8s client<https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#pods> to watch the deployment<https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#deployment%23:~:text=Watching%20a%20Deployment%3A> in a namespace with specific cluster name and respond accordingly. we define two timeouts 1. Creating the application cluster (i.e. to date if there are errors in pods, the k8s deployment is up but the application cluster is not running.) 2. Until the application cluster resources get cleaned(upon completion) - which prevent an infinite job execution or k8s glitches However, this solution is not ideal because in case this client lib crashes, the timeouts are gone. We don't want to manage these timeouts states ourselves. Any suggestion or better way? Thanks, Tamir. [https://my-email-signature.link/signature.gif?u=1088647&e=145346582&v=3f32b726c93b8d93869d4a1520a346f1c12902a66bd38eb48abc091003335147] Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.