[ https://issues.apache.org/jira/browse/FLINK-27245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hjw reopened FLINK-27245: ------------------------- Is this problem fixed in what version of Flink?Thx > Flink job on Yarn cannot revover when zookeeper in Exception > ------------------------------------------------------------ > > Key: FLINK-27245 > URL: https://issues.apache.org/jira/browse/FLINK-27245 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.7.2 > Environment: Flink :1.7.2 > Hdfs:3.1.1 > zookeeper:3.5.1 > HA defined in Flink-conf,yaml: > flink.security.enable: true > fs.output.always-create-directory: false > fs.overwrite-files: false > high-availability.job.delay: 10 s > high-availability.storageDir: hdfs:///flink/recovery > high-availability.zookeeper.client.acl: creator > high-availability.zookeeper.client.connection-timeout: 15000 > high-availability.zookeeper.client.max-retry-attempts: 3 > high-availability.zookeeper.client.retry-wait: 5000 > high-availability.zookeeper.client.session-timeout: 60000 > high-availability.zookeeper.path.root: /flink > high-availability.zookeeper.quorum: zk01:24002,zk02:24002,zk03:24002 > high-availability: zookeeper > Reporter: hjw > Priority: Major > Attachments: Job-failed.txt, Job-recover-failed.txt, > zookeeper-omm-server-a-dsj-ghficn01.2022-04-07_20-09-25.[1].log > > > Flink job cannot revover when zookeeper in Exception. > I noticed that the data in high-availability.storageDir deleled when Job > failed , resulting in failure when pulling up again. > {code:java} > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:29,002 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:29,004 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:29,004 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 10 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:30,002 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:30,002 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:30,004 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:30,004 | INFO | [Suspend state waiting handler] | > Connection to Zookeeper is SUSPENDED. Wait it to be back. Already waited 11 > seconds. | org.apache.flink.runtime.leaderelection.SmarterLeaderLatch > (SmarterLeaderLatch.java:570) > 2022-04-07 19:54:30,769 | INFO | [BlobServer shutdown hook] | > FileSystemBlobStore cleaning up > hdfs:/flink/recovery/application_1625720467511_45233. | > org.apache.flink.runtime.blob.FileSystemBlobStor > {code} > {code:java} > 2022-04-07 19:55:29,452 | INFO | [flink-akka.actor.default-dispatcher-4] | > Recovered SubmittedJobGraph(1898637f2d11429bd5f5767ea1daaf79, null). | > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore > (ZooKeeperSubmittedJobGraphStore.java:215) > 2022-04-07 19:55:29,467 | ERROR | [flink-akka.actor.default-dispatcher-17] | > Fatal error occurred in the cluster entrypoint. | > org.apache.flink.runtime.entrypoint.ClusterEntrypoint > (ClusterEntrypoint.java:408) > java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not > set up JobManager > at > org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176) > at > org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308) > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) > ... 7 common frames omitted > Caused by: java.lang.Exception: Cannot set up the user code libraries: File > does not exist: > /flink/recovery/application_1625720467511_45233/blob/job_1898637f2d11429bd5f5767ea1daaf79/blob_p-7128d0ae4a06a277e3b1182c99eb616ffd45b590-c90586d4a5d4641fcc0c9e4cab31c131 > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86) > at > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:153) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1951) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:742) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:439) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:928) > {code} > I get the log of the job failed when zookeeper happend error ,try to > restart job manager by yarn and zookeeper . > Error happended in 2022/04/07 19:54 > BTW,Where can I learn about the implementation and principle of Flink HA. > thx -- This message was sent by Atlassian Jira (v8.20.1#820001)