[jira] [Commented] (FLINK-19154) Application mode deletes HA data in case of suspended ZooKeeper connection

Till Rohrmann (Jira) Wed, 30 Sep 2020 07:53:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204790#comment-17204790
 ]


Till Rohrmann commented on FLINK-19154:
---------------------------------------

I think the causing problem is 

{code}
2020-09-04 17:32:07,950 WARN  
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application FAILED: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job (00000000000000000000000000000000)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJobStatus$17(Dispatcher.java:529)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
 ~[?:1.8.0_262]
        at 
java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)
 ~[?:1.8.0_262]
        at 
java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)
 ~[?:1.8.0_262]
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:523)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$getJobResult$0(JobStatusPollingUtils.java:57)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.pollJobResultAsync(JobStatusPollingUtils.java:81)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$3(JobStatusPollingUtils.java:96)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_262]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_262]
        at 
org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
[usercode.jar:?]
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
 [usercode.jar:?]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[usercode.jar:?]
        at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[usercode.jar:?]
        at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[usercode.jar:?]
        at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[usercode.jar:?]
Caused by: org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could 
not find Flink job (00000000000000000000000000000000)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:807)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:518)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
        ... 12 more
{code}

which is caused by the ZK being down. Since we treat exceptions coming from the 
{{ApplicationDispatcherBootstrap.fixJobIdAndRunApplicationAsync}} as a 
{{FAILED}} job state, Flink will clean the HA data up.

> Application mode deletes HA data in case of suspended ZooKeeper connection
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19154
>                 URL: https://issues.apache.org/jira/browse/FLINK-19154
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.12.0, 1.11.1
>         Environment: Run a stand-alone cluster that runs a single job (if you 
> are familiar with the way Ververica Platform runs Flink jobs, we use a very 
> similar approach). It runs Flink 1.11.1 straight from the official docker 
> image.
>            Reporter: Husky Zeng
>            Priority: Blocker
>             Fix For: 1.12.0, 1.11.3
>
>
> A user reported that Flink's application mode deletes HA data in case of a 
> suspended ZooKeeper connection [1]. 
> The problem seems to be that the {{ApplicationDispatcherBootstrap}} class 
> produces an exception (that the request job can no longer be found because of 
> a lost ZooKeeper connection) which will be interpreted as a job failure. Due 
> to this interpretation, the cluster will be shut down with a terminal state 
> of FAILED which will cause the HA data to be cleaned up. The exact problem 
> occurs in the {{JobStatusPollingUtils.getJobResult}} which is called by 
> {{ApplicationDispatcherBootstrap.getJobResult()}}.
> The above described behaviour can be found in this log [2].
> [1] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-metadata-deleted-by-Flink-after-ZK-connection-issues-td37937.html
> [2] https://pastebin.com/raw/uH9KDU2L



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19154) Application mode deletes HA data in case of suspended ZooKeeper connection

Reply via email to