Re: Task Manager fault tolerance does not work

2018-04-05 Thread Fabian Hueske
Hi, Thanks for the feedback! As Till explained, the problem is that the JM first tries to schedule the job to the failed TM (which hasn't been detected as failed yet). The configured three restart attempts are "consumed" by these attempts and the job fails afterwards. Best, Fabian 2018-04-05 8:1

Re: Task Manager fault tolerance does not work

2018-04-04 Thread dhirajpraj
Just for the record, It did not work with RestartStrategies.fixedDelayRestart(3, 5000) but worked with RestartStrategies.fixedDelayRestart(20, 5000) -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

2018-04-04 Thread dhirajpraj
As suggested by Till, it works perfectly fine after increasing the no. of retries. Thanks people. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

2018-04-03 Thread Till Rohrmann
There is a JIRA issue for the problem: https://issues.apache.org/jira/browse/FLINK-9120. Mirroring my response to this thread: The logs (attached to the JIRA ticket) show that the JM did not yet recognize the killed TM as killed when trying to restart. Thus, it tries to re-deploy tasks to this mac

Re: Task Manager fault tolerance does not work

2018-04-03 Thread Timo Walther
@Till: Do you have any advice for this issue? Am 03.04.18 um 11:54 schrieb dhirajpraj: What I have found is that the TM fault tolerance behaviour is not consistent. Sometimes it works and sometimes it doesnt. I am attaching my java code file (which is the main class). What I did was: 1) Run cl

Re: Task Manager fault tolerance does not work

2018-04-03 Thread dhirajpraj
What I have found is that the TM fault tolerance behaviour is not consistent. Sometimes it works and sometimes it doesnt. I am attaching my java code file (which is the main class). What I did was: 1) Run cluster with JM on machine A, one TM on machine B and one TM on machine C 2) Submit a job to

Re: Task Manager fault tolerance does not work

2018-04-03 Thread Timo Walther
Could you provide a little reproducible example? Which file system are you using? This sounds like a bug to me that should be fixed if valid. Am 03.04.18 um 11:28 schrieb dhirajpraj: I have not specified any parallelism in the job code. So I guess, the parallelism should be set to parallelism.d

Re: Task Manager fault tolerance does not work

2018-04-03 Thread dhirajpraj
I have not specified any parallelism in the job code. So I guess, the parallelism should be set to parallelism.default defined in the flinkConfig.yaml. An update: The TMs were on different machines and I was using FsStateBackend with state backend directories pointing to instance specific file pa

Re: Task Manager fault tolerance does not work

2018-04-03 Thread Timo Walther
Hi, does your job code declare a higher parallelism than 2? Or is submitted with a higher parallelism? What is the Web UI displaying? Regards, Timo Am 03.04.18 um 10:48 schrieb dhirajpraj: Hi, I have done that env.enableCheckpointing(5000L); env.setRestartStrategy(RestartStrategies.fixedDela

Re: Task Manager fault tolerance does not work

2018-04-03 Thread dhirajpraj
Hi, I have done that env.enableCheckpointing(5000L); env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000)); -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

2018-04-03 Thread Stephan Ewen
Please make sure you have set a number of re-tries and have checkpointing activated if you use streaming. On Fri, Mar 30, 2018 at 1:59 PM, dhirajpraj wrote: > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and paralle