[ https://issues.apache.org/jira/browse/FLINK-21980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309604#comment-17309604 ]
Till Rohrmann commented on FLINK-21980: --------------------------------------- Sounds like a good idea [~rburnett]. We could also use {{client.createContainers}}. Do you wanna take a stab at it? > ZooKeeperRunningJobsRegistry creates an empty znode > --------------------------------------------------- > > Key: FLINK-21980 > URL: https://issues.apache.org/jira/browse/FLINK-21980 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.9.3, 1.10.3, 1.11.3, 1.12.2 > Reporter: Ricky Burnett > Priority: Critical > Fix For: 1.12.3 > > > ZooKeeperRunningJobsRegistry#writeEnumToZooKeeper calls > {code:java} > this.client.newNamespaceAwareEnsurePath(zkPath).ensure(client.getZookeeperClient());{code} > This creates an empty znode in zookeeper. If the job manager is interrupted > at this point the job manager cannot recover. When trying to restore jobs on > a restarted job manager, ZooKeeperRunningJobsRegistry#getJobSchedulingStatus > will throw an exception due to the empty znode. > Behavior was verified in a test environment where the job manager was > interrupted at that point in execution leaving ZK in the following state: > {code:java} > zk: localhost:2181(CONNECTED) 2] ls /flink/default > [checkpoint-counter, checkpoints, jobgraphs, leader, leaderlatch, > running_job_registry] > [zk: localhost:2181(CONNECTED) 3] ls /flink/default/running_job_registry > [c982053dd0b9100967e6a9d89202f2a5] > [zk: localhost:2181(CONNECTED) 4] get > /flink/default/running_job_registry/c982053dd0b9100967e6a9d89202f2a5 > [zk: localhost:2181(CONNECTED) 5] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)