Re: Application cluster - Job execution and cluster creation timeouts

Yang Wang Tue, 06 Apr 2021 20:25:08 -0700

Hi Tamir,

Maybe I did not make myself clear. Here the "deployer" means our internal
Flink application deployer(actually it is ververica platform),
not the *ApplicationDeployer* interface in Flink. It helps with managing
the lifecycle of every Flink application. And it has the same native
K8s integration mechanism with you have mentioned.


In my opinion, cleaning up the infinite failover Flink application(e.g.
wrong image) is the responsibility of your own deployer, not the Flink
client. In such a case, the JobManager usually could not run normally.

However, if the JobManager could be started successfully. Then it will
clean up all the K8s resources once all the jobs reached to the
terminal status(e.g. FAILED, CANCELED, FINISHED). Even the JobManager
crashed, it could recover the jobs from latest checkpoint
successfully if HA[1] enabled.

[1].
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/overview/


Best,
Yang


Tamir Sagi <tamir.s...@niceactimize.com> 于2021年4月6日周二 下午6:43写道：

> Hey Yang
>
> Thank you for your respond
>
> We run the application cluster programmatically.
>
> I discussed about it here with an example how to run it from java and not
> CLI.
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Application-cluster-Best-Practice-td42011.html
>
> following your comment
>
> accessibility of Flink rest endpoint. When it is not ready in the
> timeout(e.g. 120s), the deployer will delete the Flink JobManager
> deployment and try to create a new one.
>
> I have not seen it in action actually, I gave a non-existing image . The
> deployer actually started the k8s deployment but pods failed to
> start(expected) , The k8s deployment was running infinite.
>
> *What configuration is that ? is it possible to override it ?*
>
> I delved into the Flink-Core, and Flink-Kubernetes jars, Since Flink is
> dependent on Kubernetes ,  we both need to leverage the Kubernetes
> client(which Flink does internally) to manage and inspecting the resources.
>
> I am curious why you have "infinite job execution" in your Flink
> application cluster. If all the jobs in the application finished, Flink will
> deregister the application and all the K8s resources should be cleaned up.
>
> My though was about what happens if there is a bug and the job running
> infinite, job manager crashes over and over again?
> What happens if resources don't get cleaned properly ? We don't want to
> keep the cluster up and running in that case and would like to get a
> feedback. Since Flink does not support that we have to inspect that
> externally.(which makes it more complex)
> We could also pull the job status using Flink client, but it become
> useless if the job is executed infinite.
>
> What do you think?
>
> Best,
> Tamir.
>
>
> ------------------------------
> *From:* Yang Wang <danrtsey...@gmail.com>
> *Sent:* Tuesday, April 6, 2021 10:36 AM
> *To:* Tamir Sagi <tamir.s...@niceactimize.com>
> *Cc:* user@flink.apache.org <user@flink.apache.org>
> *Subject:* Re: Application cluster - Job execution and cluster creation
> timeouts
>
>
> *EXTERNAL EMAIL*
>
>
> Hi Tamir,
>
> Thanks for trying the native K8s integration.
>
> 1. We do not have a timeout for creating the Flink application cluster.
> The reason is that the job submission happens on the JobManager side.
> So the Flink client does not need to wait for the JobManager running and
> then exit.
>
> I think even though the Flink client internally has the timeout, we still
> have the same problem when the Flink client crashes and then the timeout is
> gone.
>
> I want to share some other solution about the timeout. In our deployer,
> when a new Flink application is created, the deployer will periodically
> check the
> accessibility of Flink rest endpoint. When it is not ready in the
> timeout(e.g. 120s), the deployer will delete the Flink JobManager
> deployment and try to
> create a new one.
>
> 2. Actually, the current "flink run-application" does not support the real
> attached mode(waiting for all the jobs in the application finished).
> I am curious why you have "infinite job execution" in your Flink
> application cluster. If all the jobs in the application finished, Flink will
> deregister the application and all the K8s resources should be cleaned up.
>
>
> Best,
> Yang
>
>
> Tamir Sagi <tamir.s...@niceactimize.com> 于2021年4月5日周一 下午11:24写道：
>
> Hey all,
>
> We deploy application cluster natively on Kubernetes.
>
> are there any timeouts for Job execution and cluster creation?
>
> I went over the configuration page here
> <https://ci.apache.org/projects/flink/flink-docs-stable/deployment/config.html>
> but did not find anything relevant.
>
> In order to get an indication about the cluster , we leverage the k8s
> client
> <https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#pods>
>  to watch the deployment
> <https://github.com/fabric8io/kubernetes-client/blob/master/doc/CHEATSHEET.md#deployment%23:~:text=Watching%20a%20Deployment%3A>
>  in a namespace with specific cluster name and respond accordingly.
>
> we define two timeouts
>
>    1. Creating the application cluster (i.e. to date if there are errors
>    in pods, the k8s deployment is up but the application cluster is not
>    running.)
>    2. Until the application cluster resources get cleaned(upon
>    completion)  - which prevent an infinite job execution or k8s glitches
>
>
> However,  this solution is not ideal because in case this client lib
> crashes, the timeouts are gone.
> We don't want to manage these timeouts states ourselves.
>
> Any suggestion or better way?
>
> Thanks,
> Tamir.
>
>
>
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>

Re: Application cluster - Job execution and cluster creation timeouts

Reply via email to