[ https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204790#comment-17204790 ]
Till Rohrmann commented on FLINK-19154: --------------------------------------- I think the causing problem is {code} 2020-09-04 17:32:07,950 WARN org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application FAILED: java.util.concurrent.CompletionException: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find Flink job (00000000000000000000000000000000) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJobStatus$17(Dispatcher.java:529) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_262] at java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898) ~[?:1.8.0_262] at java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209) ~[?:1.8.0_262] at org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:523) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$getJobResult$0(JobStatusPollingUtils.java:57) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.client.deployment.application.JobStatusPollingUtils.pollJobResultAsync(JobStatusPollingUtils.java:81) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$3(JobStatusPollingUtils.java:96) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_262] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_262] at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154) [flink-dist_2.11-1.11.1.jar:1.11.1] at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) [usercode.jar:?] at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) [usercode.jar:?] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [usercode.jar:?] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [usercode.jar:?] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [usercode.jar:?] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [usercode.jar:?] Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find Flink job (00000000000000000000000000000000) at org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:807) ~[flink-dist_2.11-1.11.1.jar:1.11.1] at org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:518) ~[flink-dist_2.11-1.11.1.jar:1.11.1] ... 12 more {code} which is caused by the ZK being down. Since we treat exceptions coming from the {{ApplicationDispatcherBootstrap.fixJobIdAndRunApplicationAsync}} as a {{FAILED}} job state, Flink will clean the HA data up. > Application mode deletes HA data in case of suspended ZooKeeper connection > -------------------------------------------------------------------------- > > Key: FLINK-19154 > URL: https://issues.apache.org/jira/browse/FLINK-19154 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission > Affects Versions: 1.12.0, 1.11.1 > Environment: Run a stand-alone cluster that runs a single job (if you > are familiar with the way Ververica Platform runs Flink jobs, we use a very > similar approach). It runs Flink 1.11.1 straight from the official docker > image. > Reporter: Husky Zeng > Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > A user reported that Flink's application mode deletes HA data in case of a > suspended ZooKeeper connection [1]. > The problem seems to be that the {{ApplicationDispatcherBootstrap}} class > produces an exception (that the request job can no longer be found because of > a lost ZooKeeper connection) which will be interpreted as a job failure. Due > to this interpretation, the cluster will be shut down with a terminal state > of FAILED which will cause the HA data to be cleaned up. The exact problem > occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by > {{ApplicationDispatcherBootstrap.getJobResult()}}. > The above described behaviour can be found in this log [2]. > [1] > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html > [2] https://pastebin.com/raw/uH9KDU2L -- This message was sent by Atlassian Jira (v8.3.4#803005)