subject:"Behavior for flink job running on K8S failed after restart strategy exhausted"

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-06 Thread Eleanore Jin

Hi Yang, Thanks a lot for the information! Eleanore On Thu, Aug 6, 2020 at 4:20 AM Yang Wang wrote: > Hi Eleanore, > > From my experience, collecting the Flink metrics to prometheus via metrics > collector is a more ideal way. It is > also easier to configure the alert. > Maybe you could use "

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-06 Thread Yang Wang

Hi Eleanore, >From my experience, collecting the Flink metrics to prometheus via metrics collector is a more ideal way. It is also easier to configure the alert. Maybe you could use "fullRestarts" or "numRestarts" to monitor the job restarting. More metrics could be find here[2]. [1]. https://ci.

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Eleanore Jin

Hi Yang and Till, Thanks a lot for the help! I have the similar question as Till mentioned, if we do not fail Flink pods when the restart strategy is exhausted, it might be hard to monitor such failures. Today I get alerts if the k8s pods are restarted or in crash loop, but if this will no longer

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Till Rohrmann

You are right Yang Wang. Thanks for creating this issue. Cheers, Till On Wed, Aug 5, 2020 at 1:33 PM Yang Wang wrote: > Actually, the application status shows in YARN web UI is not determined by > the jobmanager process exit code. > Instead, we use "resourceManagerClient.unregisterApplicationM

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Yang Wang

Actually, the application status shows in YARN web UI is not determined by the jobmanager process exit code. Instead, we use "resourceManagerClient.unregisterApplicationMaster" to control the final status of YARN application. So although jobmanager exit with zero code, it still could show failed st

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Till Rohrmann

Yes for the other deployments it is not a problem. A reason why people preferred non-zero exit codes in case of FAILED jobs is that this is easier to monitor than having to take a look at the actual job result. Moreover, in the YARN web UI the application shows as failed if I am not mistaken. Howev

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 Thread Yang Wang

Hi Eleanore, Yes, I suggest to use Job to replace Deployment. It could be used to run jobmanager one time and finish after a successful/failed completion. However, using Job still could not solve your problem completely. Just as Till said, When a job exhausts the restart strategy, the jobmanager

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 Thread Eleanore Jin

Hi Yang & Till, Thanks for your prompt reply! Yang, regarding your question, I am actually not using k8s job, as I put my app.jar and its dependencies under flink's lib directory. I have 1 k8s deployment for job manager, and 1 k8s deployment for task manager, and 1 k8s service for job manager. A

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 Thread Yang Wang

@Till Rohrmann In native mode, when a Flink application terminates with FAILED state, all the resources will be cleaned up. However, in standalone mode, I agree with you that we need to rethink the exit code of Flink. When a job exhausts the restart strategy, we should terminate the pod and do no

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 Thread Till Rohrmann

@Yang Wang I believe that we should rethink the exit codes of Flink. In general you want K8s to restart a failed Flink process. Hence, an application which terminates in state FAILED should not return a non-zero exit code because it is a valid termination state. Cheers, Till On Tue, Aug 4, 2020

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 Thread Yang Wang

Hi Eleanore, I think you are using K8s resource "Job" to deploy the jobmanager. Please set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0. Refer here[1] for more information. Then, when the jobmanager failed because of any reason, the K8s job will be marked failed. And K8s

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 Thread Eleanore Jin

Hi Till, Thanks for the reply! I manually deploy as per-job mode [1] and I am using Flink 1.8.2. Specifically, I build a custom docker image, which I copied the app jar (not uber jar) and all its dependencies under /flink/lib. So my question is more like, in this case, if the job is marked as FA

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 Thread Till Rohrmann

Hi Eleanore, how are you deploying Flink exactly? Are you using the application mode with native K8s support to deploy a cluster [1] or are you manually deploying a per-job mode [2]? I believe the problem might be that we terminate the Flink process with a non-zero exit code if the job reaches th

Behavior for flink job running on K8S failed after restart strategy exhausted

2020-07-31 Thread Eleanore Jin

Hi Experts, I have a flink cluster (per job mode) running on kubernetes. The job is configured with restart strategy restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s So after 3 times retry, the job will be marked as FAILED, hence the pods are not running. However

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Behavior for flink job running on K8S failed after restart strategy exhausted

14 matches

Site Navigation

Mail list logo

Footer information