Hello, I am running a flink job in the application mode on k8s. It's deployed as a FlinkDeployment and its life-cycle is managed by the flink-k8s-operator. The autoscaler is being used with the following config
job.autoscaler.enabled: true job.autoscaler.metrics.window: 5m job.autoscaler.stabilization.interval: 1m job.autoscaler.target.utilization: 0.6 job.autoscaler.target.utilization.boundary: 0.2 pipeline.max-parallelism: 60 jobmanager.scheduler: adaptive During a scale-up event, the autoscaler increases the parallelism of one of the job vertex to a higher value. This triggers a bunch of new task managers to be scheduled on the EKS cluster (The node-group has an attached ASG). Out of all the requested TM pods only some get scheduled and then the cluster runs out of resources. The other TM pods remain in the "pending mode" indefinitely and the job is stuck in the "restart" loop forever. 1. Shouldn't the adaptive scheduler reduce the vertex parallelism due to the slots/TMs not being available? 2. When I looked at the pods stuck in the pending state, I found them to be reporting the following events: │ Warning FailedScheduling 4m55s (x287 over 23h) default-scheduler 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's node affinity/selector, 3 Insufficient memory. preempti │ │ on: 0/5 nodes are available: 1 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod. │ │ Normal NotTriggerScaleUp 3m26s (x8555 over 23h) cluster-autoscaler pod didn't trigger scale-up: 1 max node group size reached The WARN suggests that the "default scheduler" is being used. Why is that the case even though the adaptive scheduler is configured to be used? Appreciate it if you can shed some light on why this could be happening. Thanks Chetas