Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

Joey L Mon, 22 Nov 2021 14:10:26 -0800

Hi Matthias,

Thanks for the response. I actually found the root issue a while after
posting the question, and it is related to this JIRA ticket:
https://issues.apache.org/jira/browse/FLINK-22006


It appears to be a limit on the concurrent configmaps K8s can watch, and
adding this to my config worked.

```
containerized.master.env.KUBERNETES_MAX_CONCURRENT_REQUESTS: 300
env.java.opts.jobmanager: "-Dkubernetes.max.concurrent.requests=300"
```

Thanks,
Joey

On Tue, 23 Nov 2021 at 00:19, Matthias Pohl <matth...@ververica.com> wrote:

> Hi Joey,
> that looks like a cluster configuration issue. The 192.168.100.79:6123 is
> not accessible from the JobManager pod (see line 1224f in the provided JM
> logs):
>    2021-11-19 04:06:45,049 WARN
>  akka.remote.transport.netty.NettyTransport                   [] - Remote
> connection to [null] failed with java.net.NoRouteToHostException: No route
> to host
>    2021-11-19 04:06:45,067 WARN  akka.remote.ReliableDeliverySupervisor
>                     [] - Association with remote system [akka.tcp://
> flink@192.168.100.79:6123] has failed, address is now gated for [50] ms.
> Reason: [Association failed with [akka.tcp://flink@192.168.100.79:6123]]
> Caused by: [java.net.NoRouteToHostException: No route to host]
>
> The TaskManagers are able to communicate with the JobManager pod and are
> properly registered. The JobMaster, instead, tries to connect to the
> ResourceManager (both running on the JobManager pod) but fails.
> SlotRequests are triggered but never actually fulfilled. They are put in
> the queue for pending SlotRequests. The timeout kicks in after trying to
> reach the ResourceManager for some time. That's
> the NoResourcesAvailableException you are experiencing.
>
> Matthias
>
> On Fri, Nov 19, 2021 at 7:02 AM Joey L <joey54...@gmail.com> wrote:
>
>> Hi,
>>
>> I've set up a Flink 1.12.5 session cluster running on K8s with HA, and
>> came across an issue with creating new jobs once the cluster has reached 20
>> existing jobs. The first 20 jobs always gets initialized and start running
>> within 5 - 10 seconds.
>>
>> Any new job submission is stuck in Initializing state for a long time (10
>> - 30 mins), and eventually it goes to Running but the tasks are stuck in
>> Scheduled state despite there being free task slots available. The
>> Scheduled jobs will eventually start running, but the delay could be up to
>> an hour. Interestingly, this issue doesn't occur once I remove the HA
>> config.
>>
>> Each task manager is configured to have 4 task slots, and I can see via
>> the Flink UI that the task managers are registered correctly. (Refer to
>> attached screenshot).
>>
>> [image: Screen Shot 2021-11-19 at 3.08.11 pm.png]
>>
>> In the logs, I can see that jobs stuck in Scheduled throw this exception
>> after 5 minutes (eventhough there are slots available):
>>
>> ```
>> java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout
>> ```
>>
>> I've also attached the full job manager logs below.
>>
>> Any help/guidance would be appreciated.
>>
>> Thanks,
>> Joey
>>
>

Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

Reply via email to