Thank you for telling me about the cause.
It recovered by restarting jobmanager-5 and jobmanager-1.
I restart jobmanager-1 because when I restarted jobmanager-5 ,
checkpointing started to
fail with the following message.
2016-07-28 10:42:28,217 WARN
org.apache.flink.runtime.checkpoint.Checkp
Thanks for the logs. Looking through them it's caused by this issue:
https://issues.apache.org/jira/browse/FLINK-3800. The ExecutionGraph
(Flink's internal scheduling structure) is not terminated properly and
tries to restart the job over and over again.
This is fixed for 1.1.0. Is it an option fo
Which version of Flink are you running on? I think this might have
been fixed for the 1.1 release
(http://people.apache.org/~uce/flink-1.1.0-rc1/).
It looks like the ExecutionGraph is still trying to restart although
the JobManager is not the leader anymore. If you could provide the
complete logs
Hello,
I have standalone Flink cluster with JobManager HA.
Last night, JobManager failovered because of the connection timeout to
Zookeeper.
Job is successfully running under new leader JobManager, but when
I see the old leader JobManager log, it is trying to re-submit job and
getting errors. ( fo