Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

Matthias Pohl Mon, 22 Nov 2021 05:19:32 -0800

Hi Joey,
that looks like a cluster configuration issue. The 192.168.100.79:6123 is
not accessible from the JobManager pod (see line 1224f in the provided JM
logs):
   2021-11-19 04:06:45,049 WARN  akka.remote.transport.netty.NettyTransport
                  [] - Remote connection to [null] failed with
java.net.NoRouteToHostException: No route to host
   2021-11-19 04:06:45,067 WARN  akka.remote.ReliableDeliverySupervisor
                  [] - Association with remote system [akka.tcp://
flink@192.168.100.79:6123] has failed, address is now gated for [50] ms.
Reason: [Association failed with [akka.tcp://flink@192.168.100.79:6123]]
Caused by: [java.net.NoRouteToHostException: No route to host]


The TaskManagers are able to communicate with the JobManager pod and are
properly registered. The JobMaster, instead, tries to connect to the
ResourceManager (both running on the JobManager pod) but fails.
SlotRequests are triggered but never actually fulfilled. They are put in
the queue for pending SlotRequests. The timeout kicks in after trying to
reach the ResourceManager for some time. That's
the NoResourcesAvailableException you are experiencing.

Matthias

On Fri, Nov 19, 2021 at 7:02 AM Joey L <joey54...@gmail.com> wrote:

> Hi,
>
> I've set up a Flink 1.12.5 session cluster running on K8s with HA, and
> came across an issue with creating new jobs once the cluster has reached 20
> existing jobs. The first 20 jobs always gets initialized and start running
> within 5 - 10 seconds.
>
> Any new job submission is stuck in Initializing state for a long time (10
> - 30 mins), and eventually it goes to Running but the tasks are stuck in
> Scheduled state despite there being free task slots available. The
> Scheduled jobs will eventually start running, but the delay could be up to
> an hour. Interestingly, this issue doesn't occur once I remove the HA
> config.
>
> Each task manager is configured to have 4 task slots, and I can see via
> the Flink UI that the task managers are registered correctly. (Refer to
> attached screenshot).
>
> [image: Screen Shot 2021-11-19 at 3.08.11 pm.png]
>
> In the logs, I can see that jobs stuck in Scheduled throw this exception
> after 5 minutes (eventhough there are slots available):
>
> ```
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout
> ```
>
> I've also attached the full job manager logs below.
>
> Any help/guidance would be appreciated.
>
> Thanks,
> Joey
>

Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

Reply via email to