[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

Till Rohrmann (JIRA) Tue, 03 Apr 2018 06:42:20 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424035#comment-16424035
 ]


Till Rohrmann commented on FLINK-9120:
--------------------------------------

Hi [~dhirajpraj],

the logs show that the JM did not yet recognize the killed TM as killed when 
trying to restart. Thus, it tries to re-deploy tasks to this machine. When it 
finally realizes that the TM has been killed, it fails the jobs. At this point, 
it would try to recover the job, however, since the number of restart attempts 
are depleted (set to 3), it will fail the job terminally. Please try to raise 
the number of retry attempts. This should hopefully fix your problem.

> Task Manager Fault Tolerance issue
> ----------------------------------
>
>                 Key: FLINK-9120
>                 URL: https://issues.apache.org/jira/browse/FLINK-9120
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management, Configuration, Core
>    Affects Versions: 1.4.2
>            Reporter: dhiraj prajapati
>            Priority: Critical
>         Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

Reply via email to