I don't know of any side effects of your approach. But another workaround I saw was replacing the _0 suffix by something like "_" + System.currentMillis()
On Fri, Sep 17, 2021 at 8:38 PM Shah, Siddharth <siddharth.x.s...@gs.com> wrote: > Hi Matthias, > > > > Thanks for looking into the issue and creating a ticket. I am thinking of > having a workaround until the issue is fixed. > > > > What if I create the attempt directories with a random int by patching > *HadoopOutputFormatBase*’s open() method? > > > > Original: > > > > TaskAttemptID taskAttemptID = TaskAttemptID.*forName*( > *"attempt__0000_r_" *+ String.*format*(*"%" *+ (6 - Integer. > *toString*(taskNumber + 1).length()) + *"s"*, *" "*).replace(*" "*, *"0"*) > + Integer.*toString*(taskNumber + 1) > + *"_0"*); > > > > > > Patched: > > > > *int *attemptRandomPrefix = *new *Random().nextInt(999); > > TaskAttemptID taskAttemptID = TaskAttemptID.*forName*( > *"attempt__" *+ String.*format*(*"%" *+ (4 - > Integer.*toString*(attemptRandomPrefix).length()) > + *"s"*, *" "*).replace(*" "*, *"0"*) > + Integer.*toString*(attemptRandomPrefix) + > *"_r_" *+ String.*format*(*"%" *+ (6 - Integer.*toString*(taskNumber > + 1).length()) + *"s"*, *" "*).replace(*" "*, *"0"*) > + Integer.*toString*(taskNumber + 1) > + *"_0"*); > > > > > > So basically I am creating a directory named *attempt__0123_r_0001_0 *instead > of *attempt__0000_r_0001_0*. I have tested on a handful of our jobs and > seems to be working fine. Just wanted to check any downside of this > changes that I may not be aware of? > > > > Thanks, > > Siddharth > > > > > > > > *From:* Matthias Pohl <matth...@ververica.com> > *Sent:* Tuesday, September 07, 2021 5:06 AM > *To:* Shah, Siddharth [Engineering] <siddharth.x.s...@ny.email.gs.com> > *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] < > andreas.ha...@ny.email.gs.com> > *Subject:* Re: hdfs lease issues on flink retry > > > > Just for documentation purposes: I created FLINK-24147 [1] to cover this > issue. > > > > [1] https://issues.apache.org/jira/browse/FLINK-24147 > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D24147&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=jTswrQDq0l9TalcFRAb297cz4EfsU-LznMJyB2uXDl0&s=LgUitz7kzpyweO3xqm7f19qxwbHh_LbQ-_M1zOxutpM&e=> > > > > On Thu, Aug 26, 2021 at 6:14 PM Matthias Pohl <matth...@ververica.com> > wrote: > > I see - I should have checked my mailbox before answering. I received the > email and was able to login. > > > > On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl <matth...@ververica.com> > wrote: > > The link doesn't work, i.e. I'm redirected to a login page. It would be > also good to include the Flink logs and make them accessible for everyone. > This way others could share their perspective as well... > > > > On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] < > siddharth.x.s...@gs.com> wrote: > > Hi Matthias, > > > > Thank you for responding and taking time to look at the issue. > > > > Uploaded the yarn lags here: > https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/ > and have also requested read permissions for you. Please let us know if > you’re not able to see the files. > > > > > > *From:* Matthias Pohl <matth...@ververica.com> > *Sent:* Thursday, August 26, 2021 9:47 AM > *To:* Shah, Siddharth [Engineering] <siddharth.x.s...@ny.email.gs.com> > *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] < > andreas.ha...@ny.email.gs.com> > *Subject:* Re: hdfs lease issues on flink retry > > > > Hi Siddharth, > > thanks for reaching out to the community. This might be a bug. Could you > share your Flink and YARN logs? This way we could get a better > understanding of what's going on. > > > > Best, > Matthias > > > > On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] < > siddharth.x.s...@gs.com> wrote: > > Hi Team, > > > > We are seeing transient failures in the jobs mostly requiring higher > resources and using flink RestartStrategies > <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=> > [1]. Upon checking the yarn logs we have observed hdfs lease issues when > flink retry happens. The job originally fails for the first try with > PartitionNotFoundException > or NoResourceAvailableException., but on retry it seems form the yarn logs > is that the lease for the temp sink directory is not yet released by the > node from previous try. > > > > Initial Failure log message: > > > > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Could not allocate enough slots to run the job. Please make sure that the > cluster has enough resources. > > at > org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461) > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190) > > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > > > > > > Retry failure log message: > > > > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): > > /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet > for client 10.51.63.226 already exists > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > > > > > > > > I could verify that it’s the same nodes from previous try owning the > lease, and checked for multiple jobs by matching IP addresses. Ideally, we > want an internal retry to happen since there will be thousands of jobs > running at a time and hard to manually retry them. > > > > This is our current restart config: > > executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3, > Time.*of*(10, TimeUnit.*SECONDS*))); > > > > Is it possible to resolve leases before a retry? Or is it possible to have > different sink directories (increment attempt id somewhere) for every > retry, that way we have no lease issues? Or do you have any other > suggestion on resolving this? > > > > > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy > <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=> > > > > > > Thanks, > > Siddharth > > > > > ------------------------------ > > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices > > > ------------------------------ > > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices > > > > > > > -- > > *Matthias Pohl* | Engineer > > > > Follow us @VervericaData Ververica > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ververica.com_&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=jTswrQDq0l9TalcFRAb297cz4EfsU-LznMJyB2uXDl0&s=yD5mFS5pqguoGU4aRj0pSWC55EeAGjSn-GOfK26ZYk8&e=> > > -- > > Join Flink Forward > <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink-2Dforward.org_&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=jTswrQDq0l9TalcFRAb297cz4EfsU-LznMJyB2uXDl0&s=E2wLfyIeTdz7eBxapl_pWf3hmDuxjkVK8N0xMc4o0PE&e=> > - The Apache Flink Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton > Wehner > > > ------------------------------ > > Your Personal Data: We may collect and process information about you that > may be subject to data protection laws. For more information about how we > use and disclose your personal data, how we protect your information, our > legal basis to use your information, your rights and who you can contact, > please refer to: www.gs.com/privacy-notices >