[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

William Cummings (JIRA) Mon, 20 Aug 2018 11:09:01 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586326#comment-16586326
 ]


William Cummings edited comment on FLINK-10184 at 8/20/18 6:07 PM:
-------------------------------------------------------------------

I am also experiencing this issue in 1.5.2 (and 1.6). The artifact is removed 
from s3, but the jobgraph key is not removed from ZK. When the new jobmanager 
comes up it tries to recover the jobs listed in ZK but fails because it cannot 
locate the artifacts. As a result, a new leader is never elected.

This may or not be related, but I observed a large # of errors in the ZK log 
relating to jobgraph keys existing:

>Got user-level KeeperException when processing sessionid:0x164d2d530410335 
>type:create cxid:0x95 zxid:0x2bf6 txntype:-1 reqpath:n/a Error 
>Path:/flink/REDACTED/jobgraphs/0077a3266584e8e77b3bce81b1a586d8/27b2b7f9-a1ac-4974-8d78-3afe07c4638f
> Error:KeeperErrorCode = NodeExists for 
>/flink/REDACTED/jobgraphs/0077a3266584e8e77b3bce81b1a586d8/27b2b7f9-a1ac-4974-8d78-3afe07c4638f

These errors seemed to persist long after the job was submitted.

I've got flink building locally, I'd be happy to write a patch for this if 
someone could point me in the right direction.


was (Author: wcummings):
I am also experiencing this issue in 1.5.2 (and 1.6). The artifact is removed 
from s3, but the jobgraph key is not removed from ZK. When the new jobmanager 
comes up it tries to recover the jobs listed in ZK but fails because it cannot 
locate the artifacts. As a result, a new leader is never elected.

This may or not be related, but I observed a large # of errors in the ZK log 
relating to jobgraph keys existing:

>Got user-level KeeperException when processing sessionid:0x164d2d530410335 
>type:create cxid:0x95 zxid:0x2bf6 txntype:-1 reqpath:n/a Error 
>Path:/flink/REDACTED/jobgraphs/0077a3266584e8e77b3bce81b1a586d8/27b2b7f9-a1ac-4974-8d78-3afe07c4638f
> Error:KeeperErrorCode = NodeExists for 
>/flink/REDACTED/jobgraphs/0077a3266584e8e77b3bce81b1a586d8/27b2b7f9-a1ac-4974-8d78-3afe07c4638f

These errors seemed to persist long after the job was submitted.

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-10184
>                 URL: https://issues.apache.org/jira/browse/FLINK-10184
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.2
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a 
> savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} 
> node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. 
> You will end up with many job graphs stored in zookeeper, but none of the 
> corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those 
> old JobGraph objects from Zookeeper, then goes looking for the corresponding 
> blobs in the HA directory. The blobs are not there so the JobManager explodes 
> and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs 
> cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
> Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is 
> still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

Reply via email to