For those who would look for an answer, the fix is available in 1.18: https://issues.apache.org/jira/browse/FLINK-31498 Proposed solution is not to request for TaskManagers if there are some slots already pending.
On Thu, Jul 4, 2024 at 2:00 PM Alex Nitavsky <alexnitav...@gmail.com> wrote: > Hello community, > > I need your help and advice to troubleshoot an unexpected issue with Flink > version 1.17.2. I'm facing a problem related to Kubernetes (K8s) pod > allocation. > > I saw strange behaviour when Flink was allocating a new TM pods. Flink was > requesting new pods in a loop every 30 seconds. Newly allocated pods were > stuck in the Pending state due to some issue with scheduling on K8s side > and Flink was repeating it demand. > > The interesting part is that Flink was recognising that the amount of > pending pods is increasing, but it didn't stop to request for the new TMs. > Pods were created, but not registered. > > Extract of the logs (full logs for > `*@logger_name:org.apache.flink.kubernetes.* > OR @logger_name:org.apache.flink.runtime.resourcemanager.active.**` are > attached): > > > - need request 1 new workers, current worker number 4, declared worker > number 5 > - Requesting new worker with resource spec WorkerResourceSpec {...}, > current pending count: 1. > - Creating new TaskManager pod with name > flink-metering-evp-taskmanager-1-6 and resource <61440,6.0>. > - Pod flink-metering-evp-taskmanager-1-6 is created. > - need request 1 new workers, current worker number 5, declared worker > number 6 > - Requesting new worker with resource spec WorkerResourceSpec {...}, > current pending count: 2. > - Creating new TaskManager pod with name > flink-metering-evp-taskmanager-1-7 and resource <61440,6.0>. > - Pod flink-metering-evp-taskmanager-1-7 is created. > - need request 1 new workers, current worker number 6, declared worker > number 7 > > > I am not sure if it is some kind of raise condition in counter updates or > deliberate choice to tackle the scheduling issue. > > Kind Regards > Oleksandr > ... > >