[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424035#comment-16424035 ]
Till Rohrmann commented on FLINK-9120: -------------------------------------- Hi [~dhirajpraj], the logs show that the JM did not yet recognize the killed TM as killed when trying to restart. Thus, it tries to re-deploy tasks to this machine. When it finally realizes that the TM has been killed, it fails the jobs. At this point, it would try to recover the job, however, since the number of restart attempts are depleted (set to 3), it will fail the job terminally. Please try to raise the number of retry attempts. This should hopefully fix your problem. > Task Manager Fault Tolerance issue > ---------------------------------- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core > Affects Versions: 1.4.2 > Reporter: dhiraj prajapati > Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)