Just for documentation purposes: I created FLINK-24147 [1] to cover this issue.
[1] https://issues.apache.org/jira/browse/FLINK-24147 On Thu, Aug 26, 2021 at 6:14 PM Matthias Pohl <matth...@ververica.com> wrote: > I see - I should have checked my mailbox before answering. I received the > email and was able to login. > > On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl <matth...@ververica.com> > wrote: > >> The link doesn't work, i.e. I'm redirected to a login page. It would be >> also good to include the Flink logs and make them accessible for everyone. >> This way others could share their perspective as well... >> >> On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] < >> siddharth.x.s...@gs.com> wrote: >> >>> Hi Matthias, >>> >>> >>> >>> Thank you for responding and taking time to look at the issue. >>> >>> >>> >>> Uploaded the yarn lags here: >>> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/ >>> and have also requested read permissions for you. Please let us know if >>> you’re not able to see the files. >>> >>> >>> >>> >>> >>> *From:* Matthias Pohl <matth...@ververica.com> >>> *Sent:* Thursday, August 26, 2021 9:47 AM >>> *To:* Shah, Siddharth [Engineering] <siddharth.x.s...@ny.email.gs.com> >>> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] < >>> andreas.ha...@ny.email.gs.com> >>> *Subject:* Re: hdfs lease issues on flink retry >>> >>> >>> >>> Hi Siddharth, >>> >>> thanks for reaching out to the community. This might be a bug. Could you >>> share your Flink and YARN logs? This way we could get a better >>> understanding of what's going on. >>> >>> >>> >>> Best, >>> Matthias >>> >>> >>> >>> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] < >>> siddharth.x.s...@gs.com> wrote: >>> >>> Hi Team, >>> >>> >>> >>> We are seeing transient failures in the jobs mostly requiring higher >>> resources and using flink RestartStrategies >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=> >>> [1]. Upon checking the yarn logs we have observed hdfs lease issues when >>> flink retry happens. The job originally fails for the first try with >>> PartitionNotFoundException >>> or NoResourceAvailableException., but on retry it seems form the yarn logs >>> is that the lease for the temp sink directory is not yet released by the >>> node from previous try. >>> >>> >>> >>> Initial Failure log message: >>> >>> >>> >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>> Could not allocate enough slots to run the job. Please make sure that the >>> cluster has enough resources. >>> >>> at >>> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461) >>> >>> at >>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >>> >>> at >>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >>> >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> >>> at >>> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190) >>> >>> at >>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >>> >>> at >>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >>> >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> >>> >>> >>> >>> >>> Retry failure log message: >>> >>> >>> >>> Caused by: >>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): >>> >>> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet >>> for client 10.51.63.226 already exists >>> >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815) >>> >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702) >>> >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586) >>> >>> at >>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736) >>> >>> at >>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409) >>> >>> at >>> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >>> >>> at >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) >>> >>> >>> >>> >>> >>> >>> >>> I could verify that it’s the same nodes from previous try owning the >>> lease, and checked for multiple jobs by matching IP addresses. Ideally, we >>> want an internal retry to happen since there will be thousands of jobs >>> running at a time and hard to manually retry them. >>> >>> >>> >>> This is our current restart config: >>> >>> executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3, >>> Time.*of*(10, TimeUnit.*SECONDS*))); >>> >>> >>> >>> Is it possible to resolve leases before a retry? Or is it possible to >>> have different sink directories (increment attempt id somewhere) for every >>> retry, that way we have no lease issues? Or do you have any other >>> suggestion on resolving this? >>> >>> >>> >>> >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=> >>> >>> >>> >>> >>> >>> Thanks, >>> >>> Siddharth >>> >>> >>> >>> >>> ------------------------------ >>> >>> >>> Your Personal Data: We may collect and process information about you >>> that may be subject to data protection laws. For more information about how >>> we use and disclose your personal data, how we protect your information, >>> our legal basis to use your information, your rights and who you can >>> contact, please refer to: www.gs.com/privacy-notices >>> >>> >>> ------------------------------ >>> >>> Your Personal Data: We may collect and process information about you >>> that may be subject to data protection laws. For more information about how >>> we use and disclose your personal data, how we protect your information, >>> our legal basis to use your information, your rights and who you can >>> contact, please refer to: www.gs.com/privacy-notices >>> >> >> > > -- > > Matthias Pohl | Engineer > > Follow us @VervericaData Ververica <https://www.ververica.com/> > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton > Wehner >