Hi Matthias, Thanks for the response. I actually found the root issue a while after posting the question, and it is related to this JIRA ticket: https://issues.apache.org/jira/browse/FLINK-22006
It appears to be a limit on the concurrent configmaps K8s can watch, and adding this to my config worked. ``` containerized.master.env.KUBERNETES_MAX_CONCURRENT_REQUESTS: 300 env.java.opts.jobmanager: "-Dkubernetes.max.concurrent.requests=300" ``` Thanks, Joey On Tue, 23 Nov 2021 at 00:19, Matthias Pohl <matth...@ververica.com> wrote: > Hi Joey, > that looks like a cluster configuration issue. The 192.168.100.79:6123 is > not accessible from the JobManager pod (see line 1224f in the provided JM > logs): > 2021-11-19 04:06:45,049 WARN > akka.remote.transport.netty.NettyTransport [] - Remote > connection to [null] failed with java.net.NoRouteToHostException: No route > to host > 2021-11-19 04:06:45,067 WARN akka.remote.ReliableDeliverySupervisor > [] - Association with remote system [akka.tcp:// > flink@192.168.100.79:6123] has failed, address is now gated for [50] ms. > Reason: [Association failed with [akka.tcp://flink@192.168.100.79:6123]] > Caused by: [java.net.NoRouteToHostException: No route to host] > > The TaskManagers are able to communicate with the JobManager pod and are > properly registered. The JobMaster, instead, tries to connect to the > ResourceManager (both running on the JobManager pod) but fails. > SlotRequests are triggered but never actually fulfilled. They are put in > the queue for pending SlotRequests. The timeout kicks in after trying to > reach the ResourceManager for some time. That's > the NoResourcesAvailableException you are experiencing. > > Matthias > > On Fri, Nov 19, 2021 at 7:02 AM Joey L <joey54...@gmail.com> wrote: > >> Hi, >> >> I've set up a Flink 1.12.5 session cluster running on K8s with HA, and >> came across an issue with creating new jobs once the cluster has reached 20 >> existing jobs. The first 20 jobs always gets initialized and start running >> within 5 - 10 seconds. >> >> Any new job submission is stuck in Initializing state for a long time (10 >> - 30 mins), and eventually it goes to Running but the tasks are stuck in >> Scheduled state despite there being free task slots available. The >> Scheduled jobs will eventually start running, but the delay could be up to >> an hour. Interestingly, this issue doesn't occur once I remove the HA >> config. >> >> Each task manager is configured to have 4 task slots, and I can see via >> the Flink UI that the task managers are registered correctly. (Refer to >> attached screenshot). >> >> [image: Screen Shot 2021-11-19 at 3.08.11 pm.png] >> >> In the logs, I can see that jobs stuck in Scheduled throw this exception >> after 5 minutes (eventhough there are slots available): >> >> ``` >> java.util.concurrent.CompletionException: >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Slot request bulk is not fulfillable! Could not allocate the required slot >> within slot request timeout >> ``` >> >> I've also attached the full job manager logs below. >> >> Any help/guidance would be appreciated. >> >> Thanks, >> Joey >> >