Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Yang Wang Tue, 04 Aug 2020 02:50:46 -0700

@Till Rohrmann <trohrm...@apache.org> In native mode, when a Flink
application terminates with FAILED state, all the resources will be cleaned
up.


However, in standalone mode, I agree with you that we need to rethink the
exit code of Flink. When a job exhausts the restart
strategy, we should terminate the pod and do not restart again. After
googling, it seems that we could not specify the restartPolicy
based on exit code[1]. So maybe we need to return a zero exit code to avoid
restarting by K8s.

[1].
https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code

Best,
Yang

Till Rohrmann <trohrm...@apache.org> 于2020年8月4日周二 下午3:48写道：

> @Yang Wang <danrtsey...@gmail.com> I believe that we should rethink the
> exit codes of Flink. In general you want K8s to restart a failed Flink
> process. Hence, an application which terminates in state FAILED should not
> return a non-zero exit code because it is a valid termination state.
>
> Cheers,
> Till
>
> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <danrtsey...@gmail.com> wrote:
>
>> Hi Eleanore,
>>
>> I think you are using K8s resource "Job" to deploy the jobmanager. Please
>> set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0.
>> Refer here[1] for more information.
>>
>> Then, when the jobmanager failed because of any reason, the K8s job will
>> be marked failed. And K8s will not restart the job again.
>>
>> [1].
>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
>>
>>
>> Best,
>> Yang
>>
>> Eleanore Jin <eleanore....@gmail.com> 于2020年8月4日周二 上午12:05写道：
>>
>>> Hi Till,
>>>
>>> Thanks for the reply!
>>>
>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2.
>>> Specifically, I build a custom docker image, which I copied the app jar
>>> (not uber jar) and all its dependencies under /flink/lib.
>>>
>>> So my question is more like, in this case, if the job is marked as
>>> FAILED, which causes k8s to restart the pod, this seems not help at all,
>>> what are the suggestions for such scenario?
>>>
>>> Thanks a lot!
>>> Eleanore
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes
>>>
>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>>> Hi Eleanore,
>>>>
>>>> how are you deploying Flink exactly? Are you using the application mode
>>>> with native K8s support to deploy a cluster [1] or are you manually
>>>> deploying a per-job mode [2]?
>>>>
>>>> I believe the problem might be that we terminate the Flink process with
>>>> a non-zero exit code if the job reaches the ApplicationStatus.FAILED [3].
>>>>
>>>> cc Yang Wang have you observed a similar behavior when running Flink in
>>>> per-job mode on K8s?
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application
>>>> [2]
>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions
>>>> [3]
>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32
>>>>
>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <eleanore....@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Experts,
>>>>>
>>>>> I have a flink cluster (per job mode) running on kubernetes. The job
>>>>> is configured with restart strategy
>>>>>
>>>>> restart-strategy.fixed-delay.attempts: 
>>>>> 3restart-strategy.fixed-delay.delay: 10 s
>>>>>
>>>>>
>>>>> So after 3 times retry, the job will be marked as FAILED, hence the
>>>>> pods are not running. However, kubernetes will then restart the job again
>>>>> as the available replicas do not match the desired one.
>>>>>
>>>>> I wonder what are the suggestions for such a scenario? How should I
>>>>> configure the flink job running on k8s?
>>>>>
>>>>> Thanks a lot!
>>>>> Eleanore
>>>>>
>>>>

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

Reply via email to