Hi Siddharth,
thanks for reaching out to the community. This might be a bug. Could you
share your Flink and YARN logs? This way we could get a better
understanding of what's going on.

Best,
Matthias

On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] <
siddharth.x.s...@gs.com> wrote:

> Hi  Team,
>
>
>
> We are seeing transient failures in the jobs mostly requiring higher
> resources and using flink RestartStrategies
> <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy>
> [1]. Upon checking the yarn logs we have observed hdfs lease issues when
> flink retry happens. The job originally fails for the first try with 
> PartitionNotFoundException
> or NoResourceAvailableException., but on retry it seems form the yarn logs
> is that the lease for the temp sink directory is not yet released by the
> node from previous try.
>
>
>
> Initial Failure log message:
>
>
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate enough slots to run the job. Please make sure that the
> cluster has enough resources.
>
>         at
> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461)
>
>         at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
>         at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
>         at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
>         at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190)
>
>         at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>
>         at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>
>         at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>
>         at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>
>
>
>
>
> Retry failure log message:
>
>
>
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException):
>  
> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet
>  for client 10.51.63.226 already exists
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815)
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702)
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586)
>
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736)
>
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409)
>
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>
>
>
>
>
>
>
> I could verify that it’s the same nodes from previous try owning the
> lease, and checked for multiple jobs by matching IP addresses. Ideally, we
> want an internal retry to happen since there will be thousands of jobs
> running at a time and hard to manually retry them.
>
>
>
> This is our current restart config:
>
> executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3,
> Time.*of*(10, TimeUnit.*SECONDS*)));
>
>
>
> Is it possible to resolve leases before a retry? Or is it possible to have
> different sink directories (increment attempt id somewhere) for every
> retry, that way we have no lease issues? Or do you have any other
> suggestion on resolving this?
>
>
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy
>
>
>
>
>
> Thanks,
>
> Siddharth
>
>
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>

Reply via email to