Hi vino, Seems that jobs id stay in /jobgraphs when we cancel them manually. For example, after cancelling the job with id 75e16686cb4fe0d33ead8e29af131d09 the entry is still in zookeeper's path /flink/default/jobgraphs, but the job disappeared from /home/nas/flink/ha/default/blob/.
That is the client log: 09:20:58.492 [main] INFO org.apache.flink.client.cli.CliFrontend - Cancelling job 75e16686cb4fe0d33ead8e29af131d09. 09:20:58.503 [main] INFO org.apache.flink.runtime.blob.FileSystemBlobStore - Creating highly available BLOB storage directory at file:///home/nas/flink/ha//default/blob 09:20:58.505 [main] INFO org.apache.flink.runtime.util.ZooKeeperUtils - Enforcing default ACL for ZK connections 09:20:58.505 [main] INFO org.apache.flink.runtime.util.ZooKeeperUtils - Using '/flink-eur/default' as Zookeeper namespace. 09:20:58.539 [main] INFO o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl - Starting 09:20:58.543 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:zookeeper.version= 3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, built on 03/23/2017 10:13 GMT 09:20:58.543 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:host.name=flink-eur-production1 09:20:58.543 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.version=1.8.0_131 09:20:58.544 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation 09:20:58.546 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.home=/opt/jdk/jdk1.8.0_131/jre 09:20:58.546 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.class.path=/opt/flink/flink-1.5.0/lib/commons-httpclient-3.1.jar:/opt/flink/flink-1.5.0/lib/flink-metrics-statsd-1.5.0.jar:/opt/flink/flink-1.5.0/lib/flink-python_2.11-1.5.0.jar:/opt/flink/flink-1.5.0/lib/fluency-1.8.0.jar:/opt/flink/flink-1.5.0/lib/gcs-connector-latest-hadoop2.jar:/opt/flink/flink-1.5.0/lib/hadoop-openstack-2.7.1.jar:/opt/flink/flink-1.5.0/lib/jackson-annotations-2.8.0.jar:/opt/flink/flink-1.5.0/lib/jackson-core-2.8.10.jar:/opt/flink/flink-1.5.0/lib/jackson-databind-2.8.11.1.jar:/opt/flink/flink-1.5.0/lib/jackson-dataformat-msgpack-0.8.15.jar:/opt/flink/flink-1.5.0/lib/log4j-1.2.17.jar:/opt/flink/flink-1.5.0/lib/log4j-over-slf4j-1.7.25.jar:/opt/flink/flink-1.5.0/lib/logback-classic-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-core-1.2.3.jar:/opt/flink/flink-1.5.0/lib/logback-more-appenders-1.4.2.jar:/opt/flink/flink-1.5.0/lib/msgpack-0.6.12.jar:/opt/flink/flink-1.5.0/lib/msgpack-core-0.8.15.jar:/opt/flink/flink-1.5.0/lib/phi-accural-failure-detector-0.0.4.jar:/opt/flink/flink-1.5.0/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/flink-1.5.0/lib/flink-dist_2.11-1.5.0.jar::: 09:20:58.546 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 09:20:58.546 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 09:20:58.546 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:java.compiler=<NA> 09:20:58.547 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.name=Linux 09:20:58.547 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.arch=amd64 09:20:58.547 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:os.version=4.9.87-xxxx-std-ipv6-64 09:20:58.547 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.name=root 09:20:58.547 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.home=/root 09:20:58.547 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client environment:user.dir=/opt/flink/flink-1.5.0/bin 09:20:58.548 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Initiating client connection, connectString=10.1.1.5:2181,10.1.1.6:2181,10.1.1.7:2181 sessionTimeout=60000 watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@4a003cbe 09:20:58.555 [main-SendThread(10.1.1.5:2181)] WARN o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-9143038863636945274.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. 09:20:58.556 [main-SendThread(10.1.1.5:2181)] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server 10.1.1.5/10.1.1.5:2181 09:20:58.556 [main-EventThread] ERROR o.a.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 09:20:58.569 [main-SendThread(10.1.1.5:2181)] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Socket connection established to 10.1.1.5/10.1.1.5:2181, initiating session 09:20:58.592 [main-SendThread(10.1.1.5:2181)] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Session establishment complete on server 10.1.1.5/10.1.1.5:2181, sessionid = 0x100571bda1903b7, negotiated timeout = 40000 09:20:58.593 [main-EventThread] INFO o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager - State change: CONNECTED 09:20:58.711 [main] INFO org.apache.flink.runtime.rest.RestClient - Rest client endpoint started. 09:20:58.713 [main] INFO o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/rest_server_lock. 09:20:58.755 [main] INFO o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. 09:20:58.946 [main] INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint. 09:20:58.946 [main] INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete. 09:20:58.947 [main] INFO o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock. 09:20:58.948 [main] INFO o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. 09:20:58.949 [Curator-Framework-0] INFO o.a.f.s.c.o.a.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting 09:20:58.968 [main] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Session: 0x100571bda1903b7 closed 09:20:58.968 [main] INFO org.apache.flink.client.cli.CliFrontend - Cancelled job 75e16686cb4fe0d33ead8e29af131d09. 09:20:58.969 [main-EventThread] INFO o.a.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x100571bda1903b7 I'm assuming that in /jobgraphs there should only be the job ids that are currently running (at least it seemed that when the jobmanager restarted it tried to restart the jobs ids stored there). Is that correct? Gerard On Wed, Jul 18, 2018 at 9:17 AM vino yang <yanghua1...@gmail.com> wrote: > Hi Gerard, > > From you provide information, you mean the path in Zookeeper "/jobgraphs" > exists more jobs than you submitted? > And can not be restarted because blob files can not be find? > > Can you provide more details, about the stack trace, log and which version > of Flink? Normally, the jobgraph can not be added to Zookeeper except > submit job manually. > > Thanks, vino. > > 2018-07-16 21:19 GMT+08:00 gerardg <ger...@talaia.io>: > >> Hi, >> >> Our deployment consists of a standalone HA cluster of 8 machines with an >> external Zookeeper cluster. We have observed several times that when a >> jobmanager fails and a new one is elected, the new one tries to restart >> more jobs than the ones that were running and since it can't find some >> files, it fails and gets stuck in a restart loop. That is the error that >> we >> see in the logs: >> >> >> >> These are the contents of /home/nas/flink/ha/default/blob/: >> >> >> >> We've checked zookeeper and there are actually a lot of jobgraphs in >> /flink/default/jobgraphs >> >> >> >> There were only three jobs running so neither zookeeper nor the flink 'ha' >> folder seems to have the correct number of jobgraphs stored. >> >> The only way we have to solve this is to remove everything at path /flink >> in >> zookeeper and the 'ha' flink folder and restart the jobs manually. >> >> I'll try to monitor if some action (e.g. we have been canceling and >> restoring jobs from savepoints quite often lately) leaves an entry in >> zookeepers path /flink/default/jobgraphs of a job that is not running but >> maybe someone can't point us to some configuration problem that could >> cause >> this behavior. >> >> Thanks, >> >> Gerard >> >> >> >> >> >> -- >> Sent from: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >> > >