[ https://issues.apache.org/jira/browse/FLINK-18785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171299#comment-17171299 ]
Kai Chen edited comment on FLINK-18785 at 8/5/20, 7:01 AM: ----------------------------------------------------------- I found that flink job will normally exit if I set yarn.application-attempts=1 in flink-1.10 was (Author: yuchuanchen): I found that flink job will normally exit if I set yarn.application-attempts=1 or just delete this config in flink-1.10 > flink goes into dead lock leader election when restoring from a do-not-exist > checkpoint/savepoint path > ------------------------------------------------------------------------------------------------------ > > Key: FLINK-18785 > URL: https://issues.apache.org/jira/browse/FLINK-18785 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination > Affects Versions: 1.10.0, 1.10.1 > Environment: flink on yarn > flink-1.10.x > jdk8 > yarn.application-attempts: 2 > Reporter: Kai Chen > Priority: Major > Attachments: image-2020-07-31-19-04-19-241.png > > > flink goes into dead lock leader election when restoring from a do-not-exist > checkpoint/savepoint path. > I just run this cmd: > bin/flink run -m yarn-cluster -s "hdfs:///do/not/exist/path" > examples/streaming/ > WindowJoin.jar > when i visit UI,i meet this: > !image-2020-07-31-19-04-19-241.png! > in flink-1.9.3, the program just exits. But in 1.10.x, it stucks in leader > election > > Here is the yarn AM stack trace: > Caused by: java.util.concurrent.CompletionException: > java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1584) > at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > ... 4 more > Caused by: java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) > at > org.apache.flink.util.function.CheckedSupplier$$Lambda$125/2030620598.get(Unknown > Source) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1582) > ... 6 more > Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not > set up JobManager > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152) > at > org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:381) > at > org.apache.flink.runtime.dispatcher.Dispatcher$$Lambda$124/2142053998.get(Unknown > Source) > at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) > ... 8 more > Caused by: java.io.FileNotFoundException: Cannot find checkpoint or savepoint > file/directory 'hdfs:///path/do/not/exist' on file system 'hdfs'. > at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:243) > at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110) > at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1152) > at > org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:307) > at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:240) > at > org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:216) > at > org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:120) > at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:105) > at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:278) > at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:266) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) > at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) > at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146) > ... 12 more -- This message was sent by Atlassian Jira (v8.3.4#803005)