You are right Yang Wang. Thanks for creating this issue.
Cheers, Till On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <danrtsey...@gmail.com> wrote: > Actually, the application status shows in YARN web UI is not determined by > the jobmanager process exit code. > Instead, we use "resourceManagerClient.unregisterApplicationMaster" to > control the final status of YARN application. > So although jobmanager exit with zero code, it still could show failed > status in YARN web UI. > > I have created a ticket to track this improvement[1]. > > [1]. https://issues.apache.org/jira/browse/FLINK-18828 > > > Best, > Yang > > > Till Rohrmann <trohrm...@apache.org> 于2020年8月5日周三 下午3:56写道: > >> Yes for the other deployments it is not a problem. A reason why people >> preferred non-zero exit codes in case of FAILED jobs is that this is easier >> to monitor than having to take a look at the actual job result. Moreover, >> in the YARN web UI the application shows as failed if I am not mistaken. >> However, from a framework's perspective, a FAILED job does not mean that >> Flink has failed and, hence, the return code could still be 0 in my opinion. >> >> Cheers, >> Till >> >> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <danrtsey...@gmail.com> wrote: >> >>> Hi Eleanore, >>> >>> Yes, I suggest to use Job to replace Deployment. It could be used to run >>> jobmanager one time and finish after a successful/failed completion. >>> >>> However, using Job still could not solve your problem completely. Just >>> as Till said, When a job exhausts the restart strategy, the jobmanager >>> pod will terminate with non-zero exit code. It will cause the K8s >>> restarting it again. Even though we could set the resartPolicy and >>> backoffLimit, >>> this is not a clean and correct way to go. We should terminate the >>> jobmanager process with zero exit code in such situation. >>> >>> @Till Rohrmann <trohrm...@apache.org> I just have one concern. Is it a >>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that >>> terminating with >>> non-zero exit code is harmless. >>> >>> >>> Best, >>> Yang >>> >>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 下午11:54写道: >>> >>>> Hi Yang & Till, >>>> >>>> Thanks for your prompt reply! >>>> >>>> Yang, regarding your question, I am actually not using k8s job, as I >>>> put my app.jar and its dependencies under flink's lib directory. I have 1 >>>> k8s deployment for job manager, and 1 k8s deployment for task manager, and >>>> 1 k8s service for job manager. >>>> >>>> As you mentioned above, if flink job is marked as failed, it will cause >>>> the job manager pod to be restarted. Which is not the ideal behavior. >>>> >>>> Do you suggest that I should change the deployment strategy from using >>>> k8s deployment to k8s job? In case the flink program exit with non-zero >>>> code (e.g. exhausted number of configured restart), pod can be marked as >>>> complete hence not restarting the job again? >>>> >>>> Thanks a lot! >>>> Eleanore >>>> >>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <danrtsey...@gmail.com> wrote: >>>> >>>>> @Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink >>>>> application terminates with FAILED state, all the resources will be >>>>> cleaned >>>>> up. >>>>> >>>>> However, in standalone mode, I agree with you that we need to rethink >>>>> the exit code of Flink. When a job exhausts the restart >>>>> strategy, we should terminate the pod and do not restart again. After >>>>> googling, it seems that we could not specify the restartPolicy >>>>> based on exit code[1]. So maybe we need to return a zero exit code to >>>>> avoid restarting by K8s. >>>>> >>>>> [1]. >>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>>> >>>>> Best, >>>>> Yang >>>>> >>>>> Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道: >>>>> >>>>>> @Yang Wang <danrtsey...@gmail.com> I believe that we should >>>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>>> failed Flink process. Hence, an application which terminates in state >>>>>> FAILED should not return a non-zero exit code because it is a valid >>>>>> termination state. >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Eleanore, >>>>>>> >>>>>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>>>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>>>>> spec.backoffLimit = 0. >>>>>>> Refer here[1] for more information. >>>>>>> >>>>>>> Then, when the jobmanager failed because of any reason, the K8s job >>>>>>> will be marked failed. And K8s will not restart the job again. >>>>>>> >>>>>>> [1]. >>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Yang >>>>>>> >>>>>>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道: >>>>>>> >>>>>>>> Hi Till, >>>>>>>> >>>>>>>> Thanks for the reply! >>>>>>>> >>>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>>>>> >>>>>>>> So my question is more like, in this case, if the job is marked as >>>>>>>> FAILED, which causes k8s to restart the pod, this seems not help at >>>>>>>> all, >>>>>>>> what are the suggestions for such scenario? >>>>>>>> >>>>>>>> Thanks a lot! >>>>>>>> Eleanore >>>>>>>> >>>>>>>> [1] >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>>> >>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Eleanore, >>>>>>>>> >>>>>>>>> how are you deploying Flink exactly? Are you using the application >>>>>>>>> mode with native K8s support to deploy a cluster [1] or are you >>>>>>>>> manually >>>>>>>>> deploying a per-job mode [2]? >>>>>>>>> >>>>>>>>> I believe the problem might be that we terminate the Flink process >>>>>>>>> with a non-zero exit code if the job reaches the >>>>>>>>> ApplicationStatus.FAILED >>>>>>>>> [3]. >>>>>>>>> >>>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>>> Flink in per-job mode on K8s? >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>>> [2] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>>> [3] >>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>>> >>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>>> eleanore....@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Experts, >>>>>>>>>> >>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The >>>>>>>>>> job is configured with restart strategy >>>>>>>>>> >>>>>>>>>> restart-strategy.fixed-delay.attempts: >>>>>>>>>> 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence >>>>>>>>>> the pods are not running. However, kubernetes will then restart the >>>>>>>>>> job >>>>>>>>>> again as the available replicas do not match the desired one. >>>>>>>>>> >>>>>>>>>> I wonder what are the suggestions for such a scenario? How should >>>>>>>>>> I configure the flink job running on k8s? >>>>>>>>>> >>>>>>>>>> Thanks a lot! >>>>>>>>>> Eleanore >>>>>>>>>> >>>>>>>>>