[ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16587119#comment-16587119 ]
Thomas Wozniakowski edited comment on FLINK-10184 at 8/21/18 7:52 AM: ---------------------------------------------------------------------- Hey [~wcummings], I'm not 100% sure what is wrong, but I believe a good starting point would be {{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}} or {{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove}} was (Author: jamalarm): Hey [~wcummings], I'm not 100% sure what is wrong, but I believe a good starting point would be {{org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore#removeJobGraph}} or {{org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore#releaseAndTryRemove)}} > HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel > ------------------------------------------------------------------------------ > > Key: FLINK-10184 > URL: https://issues.apache.org/jira/browse/FLINK-10184 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.2 > Reporter: Thomas Wozniakowski > Priority: Blocker > > We have encountered a blocking issue when upgrading our cluster to 1.5.2. > It appears that, when jobs are cancelled manually (in our case with a > savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} > node. > This means that, if you start a job, cancel it, restart it, cancel it, etc. > You will end up with many job graphs stored in zookeeper, but none of the > corresponding blobs in the Flink HA directory. > When a HA failover occurs, the newly elected leader retrieves all of those > old JobGraph objects from Zookeeper, then goes looking for the corresponding > blobs in the HA directory. The blobs are not there so the JobManager explodes > and the process dies. > At this point the cluster has to be fully stopped, the zookeeper jobgraphs > cleared out by hand, and all the jobmanagers restarted. > I can see the following line in the JobManager logs: > {quote} > 2018-08-20 16:17:20,776 INFO > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper. > {quote} > But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is > still very much there. -- This message was sent by Atlassian JIRA (v7.6.3#76005)