[ https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204747#comment-17204747 ]
Kostas Kloudas commented on FLINK-19154: ---------------------------------------- Disclaimer: I am not super-familiar with ZK and the steps involved in HA failover and what I am saying may be wrong. >From the discussion here I understand that there is a ZK transitive issue and >Flink detects it and restarts the affected jobs. During the restarting >process, Flink tells ZK to delete the HA data. In this last part, there seems to be a race condition, right? ZK is down (triggered the failure) and then it comes back up. In the meantime jobs are shutting down and tell ZK to delete their data. Can it be that for most of the jobs ZK is still down when they try to delete their HA data and the request fails while for the "unlucky one" this is not the case? [~casidiablo] Is there anything in the logs of the jobs that successfully restarted that could justify such an explanation? > Application mode deletes HA data in case of suspended ZooKeeper connection > -------------------------------------------------------------------------- > > Key: FLINK-19154 > URL: https://issues.apache.org/jira/browse/FLINK-19154 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission > Affects Versions: 1.12.0, 1.11.1 > Environment: Run a stand-alone cluster that runs a single job (if you > are familiar with the way Ververica Platform runs Flink jobs, we use a very > similar approach). It runs Flink 1.11.1 straight from the official docker > image. > Reporter: Husky Zeng > Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > A user reported that Flink's application mode deletes HA data in case of a > suspended ZooKeeper connection [1]. > The problem seems to be that the {{ApplicationDispatcherBootstrap}} class > produces an exception (that the request job can no longer be found because of > a lost ZooKeeper connection) which will be interpreted as a job failure. Due > to this interpretation, the cluster will be shut down with a terminal state > of FAILED which will cause the HA data to be cleaned up. The exact problem > occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by > {{ApplicationDispatcherBootstrap.getJobResult()}}. > The above described behaviour can be found in this log [2]. > [1] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html > [2] https://pastebin.com/raw/uH9KDU2L -- This message was sent by Atlassian Jira (v8.3.4#803005)