[ https://issues.apache.org/jira/browse/FLINK-24063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17407082#comment-17407082 ]
Aitozi edited comment on FLINK-24063 at 8/31/21, 5:40 AM: ---------------------------------------------------------- Looking forward to your opinion on this [~trohrm...@apache.org] was (Author: aitozi): Looking forward to your opinion on this [~trohrmann] > Reconsider the behavior of ClusterEntrypoint#startCluster failure handler > ------------------------------------------------------------------------- > > Key: FLINK-24063 > URL: https://issues.apache.org/jira/browse/FLINK-24063 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Aitozi > Priority: Minor > > If the job runCluster failed, it will trigger the STOP_APPLICATION behavior. > But if we consider a case like that: > # A job have running for a long time > # Then the JobManager encounter a fatal error like the network problem, > which may let the jobManager process down > # Then a new process will be started by the resource framework like yarn or > kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to > the same network problem. > # Then the job turn into the FAILED status. > > This means a streaming job will no longer run due to some fatal error, this > is somehow fragile. I think we should give some retry mechanism to prevent > the job fast fail twice ,so that deal with some external error which may keep > for a period of time. -- This message was sent by Atlassian Jira (v8.3.4#803005)