Hey! Let me first answer your questions then provide some actual solution hopefully :)
1. The adaptive scheduler would not reduce the vertex desired parallelism in this case but it should allow the job to start depending on the lower/upper bound resource config. There have been some changes in how the k8s operator sets these resource requirements, in the latest 1.8.0 we only set the upper bound so that the job can still start with a smaller parallelism. So Flink ultimately will keep trying to schedule pods but ideally the job would also start/run. I would look at the scheduler logs (maybe debug) for more detail. You can look at configs like: https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-wait-timeout 2. Default scheduler here refers to the Kubernetes pod scheduler not Flink's schedulers. So this is normal As for the solution to the problem. The thing to do is to make the autoscaler aware of the resource limits in the first place so that we don't scale the job too high. There has been some work on autodetecting these limits https://issues.apache.org/jira/browse/FLINK-33771 You can set: kubernetes.operator.cluster.resource-view.refresh-interval: 5 min to turn this on. Alternatively a simpler approach would be to directly limit the parallelism of the scaling decisions: job.autoscaler.vertex.max-parallelism Cheers, Gyula On Mon, May 6, 2024 at 8:09 AM Chetas Joshi <chetas.jo...@gmail.com> wrote: > Hello, > > I am running a flink job in the application mode on k8s. It's deployed as > a FlinkDeployment and its life-cycle is managed by the flink-k8s-operator. > The autoscaler is being used with the following config > > job.autoscaler.enabled: true > job.autoscaler.metrics.window: 5m > job.autoscaler.stabilization.interval: 1m > job.autoscaler.target.utilization: 0.6 > job.autoscaler.target.utilization.boundary: 0.2 > pipeline.max-parallelism: 60 > jobmanager.scheduler: adaptive > > During a scale-up event, the autoscaler increases the parallelism of one > of the job vertex to a higher value. This triggers a bunch of new task > managers to be scheduled on the EKS cluster (The node-group has an attached > ASG). Out of all the requested TM pods only some get scheduled and then the > cluster runs out of resources. The other TM pods remain in the "pending > mode" indefinitely and the job is stuck in the "restart" loop forever. > > 1. Shouldn't the adaptive scheduler reduce the vertex parallelism due to > the slots/TMs not being available? > 2. When I looked at the pods stuck in the pending state, I found them to > be reporting the following events: > > │ Warning FailedScheduling 4m55s (x287 over 23h) default-scheduler > 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's > node affinity/selector, 3 Insufficient memory. preempti │ > > │ on: 0/5 nodes are available: 1 Preemption is not helpful for scheduling, > 4 No preemption victims found for incoming pod. > │ > > │ Normal NotTriggerScaleUp 3m26s (x8555 over 23h) cluster-autoscaler > pod didn't trigger scale-up: 1 max node group size reached > > The WARN suggests that the "default scheduler" is being used. Why is that > the case even though the adaptive scheduler is configured to be used? > > Appreciate it if you can shed some light on why this could be happening. > > Thanks > Chetas >