Hi Siddharth, thanks for reaching out to the community. This might be a bug. Could you share your Flink and YARN logs? This way we could get a better understanding of what's going on.
Best, Matthias On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] < siddharth.x.s...@gs.com> wrote: > Hi Team, > > > > We are seeing transient failures in the jobs mostly requiring higher > resources and using flink RestartStrategies > <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy> > [1]. Upon checking the yarn logs we have observed hdfs lease issues when > flink retry happens. The job originally fails for the first try with > PartitionNotFoundException > or NoResourceAvailableException., but on retry it seems form the yarn logs > is that the lease for the temp sink directory is not yet released by the > node from previous try. > > > > Initial Failure log message: > > > > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate enough slots to run the job. Please make sure that the > cluster has enough resources. > > at > org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461) > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190) > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > > > > > Retry failure log message: > > > > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): > > /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet > for client 10.51.63.226 already exists > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > > > > > > > > I could verify that it’s the same nodes from previous try owning the > lease, and checked for multiple jobs by matching IP addresses. Ideally, we > want an internal retry to happen since there will be thousands of jobs > running at a time and hard to manually retry them. > > > > This is our current restart config: > > executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3, > Time.*of*(10, TimeUnit.*SECONDS*))); > > > > Is it possible to resolve leases before a retry? Or is it possible to have > different sink directories (increment attempt id somewhere) for every > retry, that way we have no lease issues? Or do you have any other > suggestion on resolving this? > > > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy > > > > > > Thanks, > > Siddharth > > > > ------------------------------ > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices >