[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

dhiraj prajapati (JIRA) Tue, 03 Apr 2018 02:45:02 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423753#comment-16423753
 ]


dhiraj prajapati commented on FLINK-9120:
-----------------------------------------

Hi [~till.rohrmann], with MemoryStateBackend, the state should be accessible to 
all TMs as long as the JM is up and running, right? Even then with 
MemoryStateBackend, the TM fault tolerance behaviour is not consistent. 
Sometimes it works and some times it doesn't.

 

Hi [~sihuazhou], can you please elaborate on " TM doesn't unregister from JM 
properly in standalone model" ? If one of the TMs gets terminated due to machne 
crash or any other reason, it will obviously not be able to unregister from JM. 
But the other TM should pick up the job and the job shouldn't fail right?

> Task Manager Fault Tolerance issue
> ----------------------------------
>
>                 Key: FLINK-9120
>                 URL: https://issues.apache.org/jira/browse/FLINK-9120
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management, Configuration, Core
>    Affects Versions: 1.4.2
>            Reporter: dhiraj prajapati
>            Priority: Critical
>         Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

Reply via email to