[ https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangyu Zhu updated FLINK-10133: -------------------------------- Attachment: client.log namenode.log zookeeper.log standalonesession.log > finished job's jobgraph never been cleaned up in zookeeper for standalone > clusters (HA mode with multiple masters) > ------------------------------------------------------------------------------------------------------------------ > > Key: FLINK-10133 > URL: https://issues.apache.org/jira/browse/FLINK-10133 > Project: Flink > Issue Type: Bug > Components: JobManager > Affects Versions: 1.5.0, 1.5.2, 1.6.0 > Reporter: Xiangyu Zhu > Priority: Major > Attachments: client.log, namenode.log, standalonesession.log, > zookeeper.log > > > Hi, > We have 3 servers in our test environment, noted as node1-3. Setup is as > following: > * hadoop hdfs: node1 as namenode, node2,3 as datanode > * zookeeper: node1-3 as a quorum (but also tried node1 alone) > * flink: node1,2 as masters, node2,3 as slaves > As my understanding when a job finished the corresponding job's blob data is > expected to be deleted from hdfs path and node under zookeeper's path `/\{zk > path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. > However we observe that whenever we submitted a job and it finished (via > `bin/flink run WordCount.jar`), the blob data is gone whereas job id node > under zookeeper is still there, with a uuid style lock node inside it. From > the debug node in zookeeper we observed something like "cannot be deleted > because non empty". Because of this, as long as a job is finished and the > jobgraph node persists, if restart the clusters or kill one manager (to test > HA mode), it tries to recover a finished job and couldn't find blob data > under hdfs, and the whole cluster is down. > If we use only node1 as master and node2,3 as slaves, the jobgraphs node can > be deleted successfully. If the jobgraphs is clean, killing one job manager > makes another stand-by JM raised as leader, so it is only this jobgraphs > issue preventing HA from working. > I'm not sure if there's something wrong with our configs because this happens > every time for finished job (we only tested with wordcount.jar though). I'm > aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens > every time, rendering HA mode un-useable for us. > Any idea what might cause this? -- This message was sent by Atlassian JIRA (v7.6.3#76005)