[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423753#comment-16423753 ]
dhiraj prajapati commented on FLINK-9120: ----------------------------------------- Hi [~till.rohrmann], with MemoryStateBackend, the state should be accessible to all TMs as long as the JM is up and running, right? Even then with MemoryStateBackend, the TM fault tolerance behaviour is not consistent. Sometimes it works and some times it doesn't. Hi [~sihuazhou], can you please elaborate on " TM doesn't unregister from JM properly in standalone model" ? If one of the TMs gets terminated due to machne crash or any other reason, it will obviously not be able to unregister from JM. But the other TM should pick up the job and the job shouldn't fail right? > Task Manager Fault Tolerance issue > ---------------------------------- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core > Affects Versions: 1.4.2 > Reporter: dhiraj prajapati > Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)