Re: JobManager trying to re-submit jobs after failover

2016-07-27 Thread Hironori Ogibayashi
Thank you for telling me about the cause. It recovered by restarting jobmanager-5 and jobmanager-1. I restart jobmanager-1 because when I restarted jobmanager-5 , checkpointing started to fail with the following message. 2016-07-28 10:42:28,217 WARN org.apache.flink.runtime.checkpoint.Checkp

Re: JobManager trying to re-submit jobs after failover

2016-07-27 Thread Ufuk Celebi
Thanks for the logs. Looking through them it's caused by this issue: https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph (Flink's internal scheduling structure) is not terminated properly and tries to restart the job over and over again. This is fixed for 1.1.0. Is it an option fo

Re: JobManager trying to re-submit jobs after failover

2016-07-27 Thread Ufuk Celebi
Which version of Flink are you running on? I think this might have been fixed for the 1.1 release (http://people.apache.org/~uce/flink-1.1.0-rc1/). It looks like the ExecutionGraph is still trying to restart although the JobManager is not the leader anymore. If you could provide the complete logs

JobManager trying to re-submit jobs after failover

2016-07-27 Thread Hironori Ogibayashi
Hello, I have standalone Flink cluster with JobManager HA. Last night, JobManager failovered because of the connection timeout to Zookeeper. Job is successfully running under new leader JobManager, but when I see the old leader JobManager log, it is trying to re-submit job and getting errors. ( fo