Hi Gerard, >From you provide information, you mean the path in Zookeeper "/jobgraphs" exists more jobs than you submitted? And can not be restarted because blob files can not be find?
Can you provide more details, about the stack trace, log and which version of Flink? Normally, the jobgraph can not be added to Zookeeper except submit job manually. Thanks, vino. 2018-07-16 21:19 GMT+08:00 gerardg <ger...@talaia.io>: > Hi, > > Our deployment consists of a standalone HA cluster of 8 machines with an > external Zookeeper cluster. We have observed several times that when a > jobmanager fails and a new one is elected, the new one tries to restart > more jobs than the ones that were running and since it can't find some > files, it fails and gets stuck in a restart loop. That is the error that we > see in the logs: > > > > These are the contents of /home/nas/flink/ha/default/blob/: > > > > We've checked zookeeper and there are actually a lot of jobgraphs in > /flink/default/jobgraphs > > > > There were only three jobs running so neither zookeeper nor the flink 'ha' > folder seems to have the correct number of jobgraphs stored. > > The only way we have to solve this is to remove everything at path /flink > in > zookeeper and the 'ha' flink folder and restart the jobs manually. > > I'll try to monitor if some action (e.g. we have been canceling and > restoring jobs from savepoints quite often lately) leaves an entry in > zookeepers path /flink/default/jobgraphs of a job that is not running but > maybe someone can't point us to some configuration problem that could cause > this behavior. > > Thanks, > > Gerard > > > > > > -- > Sent from: http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/ >