Hey Gyula, Thanks for getting back.
1) Yes, some more testing revealed the job was able to start with lower parallelism i.e. lower than the upper bound that was set by the adaptive scheduler. 2) I am limiting the parallelism of any job-vertex by setting pipeline.max-parallelism to a value that keeps the number of TMs in check for a given capacity on my EKS cluster. Chetas On Sun, May 5, 2024 at 11:39 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > Hey! > > Let me first answer your questions then provide some actual solution > hopefully :) > > 1. The adaptive scheduler would not reduce the vertex desired parallelism > in this case but it should allow the job to start depending on the > lower/upper bound resource config. There have been some changes in how the > k8s operator sets these resource requirements, in the latest 1.8.0 we only > set the upper bound so that the job can still start with a smaller > parallelism. So Flink ultimately will keep trying to schedule pods but > ideally the job would also start/run. I would look at the scheduler logs > (maybe debug) for more detail. > > You can look at configs like: > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout > > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-wait-timeout > > 2. Default scheduler here refers to the Kubernetes pod scheduler not > Flink's schedulers. So this is normal > > As for the solution to the problem. The thing to do is to make the > autoscaler aware of the resource limits in the first place so that we don't > scale the job too high. There has been some work on autodetecting these > limits https://issues.apache.org/jira/browse/FLINK-33771 > > You can set: > kubernetes.operator.cluster.resource-view.refresh-interval: 5 min > > to turn this on. Alternatively a simpler approach would be to directly > limit the parallelism of the scaling decisions: > job.autoscaler.vertex.max-parallelism > > Cheers, > Gyula > > On Mon, May 6, 2024 at 8:09 AM Chetas Joshi <chetas.jo...@gmail.com> > wrote: > >> Hello, >> >> I am running a flink job in the application mode on k8s. It's deployed as >> a FlinkDeployment and its life-cycle is managed by the flink-k8s-operator. >> The autoscaler is being used with the following config >> >> job.autoscaler.enabled: true >> job.autoscaler.metrics.window: 5m >> job.autoscaler.stabilization.interval: 1m >> job.autoscaler.target.utilization: 0.6 >> job.autoscaler.target.utilization.boundary: 0.2 >> pipeline.max-parallelism: 60 >> jobmanager.scheduler: adaptive >> >> During a scale-up event, the autoscaler increases the parallelism of one >> of the job vertex to a higher value. This triggers a bunch of new task >> managers to be scheduled on the EKS cluster (The node-group has an attached >> ASG). Out of all the requested TM pods only some get scheduled and then the >> cluster runs out of resources. The other TM pods remain in the "pending >> mode" indefinitely and the job is stuck in the "restart" loop forever. >> >> 1. Shouldn't the adaptive scheduler reduce the vertex parallelism due to >> the slots/TMs not being available? >> 2. When I looked at the pods stuck in the pending state, I found them to >> be reporting the following events: >> >> │ Warning FailedScheduling 4m55s (x287 over 23h) default-scheduler >> 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match >> Pod's node affinity/selector, 3 Insufficient memory. preempti │ >> >> │ on: 0/5 nodes are available: 1 Preemption is not helpful for >> scheduling, 4 No preemption victims found for incoming pod. >> │ >> >> │ Normal NotTriggerScaleUp 3m26s (x8555 over 23h) >> cluster-autoscaler pod didn't trigger scale-up: 1 max node group size >> reached >> >> The WARN suggests that the "default scheduler" is being used. Why is that >> the case even though the adaptive scheduler is configured to be used? >> >> Appreciate it if you can shed some light on why this could be happening. >> >> Thanks >> Chetas >> >