Xiangyu Zhu created FLINK-10133:
-----------------------------------

             Summary: finished job's jobgraph never been cleaned up in 
zookeeper for standalone clusters (HA mode with multiple masters)
                 Key: FLINK-10133
                 URL: https://issues.apache.org/jira/browse/FLINK-10133
             Project: Flink
          Issue Type: Bug
          Components: JobManager
    Affects Versions: 1.6.0, 1.5.2, 1.5.0
            Reporter: Xiangyu Zhu


Hi,

We have 3 servers in our test environment, noted as node1-3. Setup is as 
following:
 * hadoop hdfs: node1 as namenode, node2,3 as datanode
 * zookeeper: node1-3 as a quorum (but also tried node1 alone)
 * flink: node1,2 as masters, node2,3 as slaves

As my understanding when a job finished the corresponding job's blob data is 
expected to be deleted from hdfs path and node under zookeeper's path `/\{zk 
path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. 
However we observe that whenever we submitted a job and it finished (via 
`bin/flink run WordCount.jar`), the blob data is gone whereas job id node under 
zookeeper is still there, with a uuid style lock node inside it. From the debug 
node in zookeeper we observed something like "cannot be deleted because non 
empty". Because of this, as long as a job is finished and the jobgraph node 
persists, if restart the clusters or kill one manager (to test HA mode), it 
tries to recover a finished job and couldn't find blob data under hdfs, and the 
whole cluster is down.

If we tried with only node1 as master and node2,3 as slaves, the jobgraphs node 
can be deleted successfully. If the jobgraphs is clean, killing one job manager 
makes another stand-by JM raised as leader, so it is only this jobgraphs issue 
preventing HA from working.

I'm not sure if there's something wrong with our configs because this happens 
every time for finished job (we only tested with wordcount.jar though). I'm 
aware of #10011 and #10129, but unlike #10011 this happens every time, renders 
HA mode un-useable for us.

Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to