Hi Joey, that looks like a cluster configuration issue. The 192.168.100.79:6123 is not accessible from the JobManager pod (see line 1224f in the provided JM logs): 2021-11-19 04:06:45,049 WARN akka.remote.transport.netty.NettyTransport [] - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host 2021-11-19 04:06:45,067 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp:// flink@192.168.100.79:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@192.168.100.79:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
The TaskManagers are able to communicate with the JobManager pod and are properly registered. The JobMaster, instead, tries to connect to the ResourceManager (both running on the JobManager pod) but fails. SlotRequests are triggered but never actually fulfilled. They are put in the queue for pending SlotRequests. The timeout kicks in after trying to reach the ResourceManager for some time. That's the NoResourcesAvailableException you are experiencing. Matthias On Fri, Nov 19, 2021 at 7:02 AM Joey L <joey54...@gmail.com> wrote: > Hi, > > I've set up a Flink 1.12.5 session cluster running on K8s with HA, and > came across an issue with creating new jobs once the cluster has reached 20 > existing jobs. The first 20 jobs always gets initialized and start running > within 5 - 10 seconds. > > Any new job submission is stuck in Initializing state for a long time (10 > - 30 mins), and eventually it goes to Running but the tasks are stuck in > Scheduled state despite there being free task slots available. The > Scheduled jobs will eventually start running, but the delay could be up to > an hour. Interestingly, this issue doesn't occur once I remove the HA > config. > > Each task manager is configured to have 4 task slots, and I can see via > the Flink UI that the task managers are registered correctly. (Refer to > attached screenshot). > > [image: Screen Shot 2021-11-19 at 3.08.11 pm.png] > > In the logs, I can see that jobs stuck in Scheduled throw this exception > after 5 minutes (eventhough there are slots available): > > ``` > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > Slot request bulk is not fulfillable! Could not allocate the required slot > within slot request timeout > ``` > > I've also attached the full job manager logs below. > > Any help/guidance would be appreciated. > > Thanks, > Joey >