Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-06 Thread Eleanore Jin
Hi Yang, Thanks a lot for the information! Eleanore On Thu, Aug 6, 2020 at 4:20 AM Yang Wang wrote: > Hi Eleanore, > > From my experience, collecting the Flink metrics to prometheus via metrics > collector is a more ideal way. It is > also easier to configure the alert. > Maybe you could use "

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-06 Thread Yang Wang
Hi Eleanore, >From my experience, collecting the Flink metrics to prometheus via metrics collector is a more ideal way. It is also easier to configure the alert. Maybe you could use "fullRestarts" or "numRestarts" to monitor the job restarting. More metrics could be find here[2]. [1]. https://ci.

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Eleanore Jin
Hi Yang and Till, Thanks a lot for the help! I have the similar question as Till mentioned, if we do not fail Flink pods when the restart strategy is exhausted, it might be hard to monitor such failures. Today I get alerts if the k8s pods are restarted or in crash loop, but if this will no longer

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Till Rohrmann
You are right Yang Wang. Thanks for creating this issue. Cheers, Till On Wed, Aug 5, 2020 at 1:33 PM Yang Wang wrote: > Actually, the application status shows in YARN web UI is not determined by > the jobmanager process exit code. > Instead, we use "resourceManagerClient.unregisterApplicationM

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Yang Wang
Actually, the application status shows in YARN web UI is not determined by the jobmanager process exit code. Instead, we use "resourceManagerClient.unregisterApplicationMaster" to control the final status of YARN application. So although jobmanager exit with zero code, it still could show failed st

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Till Rohrmann
Yes for the other deployments it is not a problem. A reason why people preferred non-zero exit codes in case of FAILED jobs is that this is easier to monitor than having to take a look at the actual job result. Moreover, in the YARN web UI the application shows as failed if I am not mistaken. Howev

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Yang Wang
Hi Eleanore, Yes, I suggest to use Job to replace Deployment. It could be used to run jobmanager one time and finish after a successful/failed completion. However, using Job still could not solve your problem completely. Just as Till said, When a job exhausts the restart strategy, the jobmanager

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 Thread Eleanore Jin
Hi Yang & Till, Thanks for your prompt reply! Yang, regarding your question, I am actually not using k8s job, as I put my app.jar and its dependencies under flink's lib directory. I have 1 k8s deployment for job manager, and 1 k8s deployment for task manager, and 1 k8s service for job manager. A

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 Thread Yang Wang
@Till Rohrmann In native mode, when a Flink application terminates with FAILED state, all the resources will be cleaned up. However, in standalone mode, I agree with you that we need to rethink the exit code of Flink. When a job exhausts the restart strategy, we should terminate the pod and do no

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 Thread Till Rohrmann
@Yang Wang I believe that we should rethink the exit codes of Flink. In general you want K8s to restart a failed Flink process. Hence, an application which terminates in state FAILED should not return a non-zero exit code because it is a valid termination state. Cheers, Till On Tue, Aug 4, 2020

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 Thread Yang Wang
Hi Eleanore, I think you are using K8s resource "Job" to deploy the jobmanager. Please set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0. Refer here[1] for more information. Then, when the jobmanager failed because of any reason, the K8s job will be marked failed. And K8s

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 Thread Eleanore Jin
Hi Till, Thanks for the reply! I manually deploy as per-job mode [1] and I am using Flink 1.8.2. Specifically, I build a custom docker image, which I copied the app jar (not uber jar) and all its dependencies under /flink/lib. So my question is more like, in this case, if the job is marked as FA

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 Thread Till Rohrmann
Hi Eleanore, how are you deploying Flink exactly? Are you using the application mode with native K8s support to deploy a cluster [1] or are you manually deploying a per-job mode [2]? I believe the problem might be that we terminate the Flink process with a non-zero exit code if the job reaches th

Behavior for flink job running on K8S failed after restart strategy exhausted

2020-07-31 Thread Eleanore Jin
Hi Experts, I have a flink cluster (per job mode) running on kubernetes. The job is configured with restart strategy restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s So after 3 times retry, the job will be marked as FAILED, hence the pods are not running. However