Hey!

Let me first answer your questions then provide some actual solution
hopefully :)

1. The adaptive scheduler would not reduce the vertex desired parallelism
in this case but it should allow the job to start depending on the
lower/upper bound resource config. There have been some changes in how the
k8s operator sets these resource requirements, in the latest 1.8.0 we only
set the upper bound so that the job can still start with a smaller
parallelism. So Flink ultimately will keep trying to schedule pods but
ideally the job would also start/run. I would look at the scheduler logs
(maybe debug) for more detail.

You can look at configs like:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-wait-timeout

2. Default scheduler here refers to the Kubernetes pod scheduler not
Flink's schedulers. So this is normal

As for the solution to the problem. The thing to do is to make the
autoscaler aware of the resource limits in the first place so that we don't
scale the job too high. There has been some work on autodetecting these
limits https://issues.apache.org/jira/browse/FLINK-33771

You can set:
kubernetes.operator.cluster.resource-view.refresh-interval: 5 min

to turn this on. Alternatively a simpler approach would be to directly
limit the parallelism of the scaling decisions:
job.autoscaler.vertex.max-parallelism

Cheers,
Gyula

On Mon, May 6, 2024 at 8:09 AM Chetas Joshi <chetas.jo...@gmail.com> wrote:

> Hello,
>
> I am running a flink job in the application mode on k8s. It's deployed as
> a FlinkDeployment and its life-cycle is managed by the flink-k8s-operator.
> The autoscaler is being used with the following config
>
> job.autoscaler.enabled: true
> job.autoscaler.metrics.window: 5m
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.target.utilization: 0.6
> job.autoscaler.target.utilization.boundary: 0.2
> pipeline.max-parallelism: 60
> jobmanager.scheduler: adaptive
>
> During a scale-up event, the autoscaler increases the parallelism of one
> of the job vertex to a higher value. This triggers a bunch of new task
> managers to be scheduled on the EKS cluster (The node-group has an attached
> ASG). Out of all the requested TM pods only some get scheduled and then the
> cluster runs out of resources. The other TM pods remain in the "pending
> mode" indefinitely and the job is stuck in the "restart" loop forever.
>
> 1. Shouldn't the adaptive scheduler reduce the vertex parallelism due to
> the slots/TMs not being available?
> 2. When I looked at the pods stuck in the pending state, I found them to
> be reporting the following events:
>
> │   Warning  FailedScheduling   4m55s (x287 over 23h)   default-scheduler
> 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match Pod's
> node affinity/selector, 3 Insufficient memory. preempti │
>
> │ on: 0/5 nodes are available: 1 Preemption is not helpful for scheduling,
> 4 No preemption victims found for incoming pod.
>                                                           │
>
> │   Normal   NotTriggerScaleUp  3m26s (x8555 over 23h)  cluster-autoscaler
> pod didn't trigger scale-up: 1 max node group size reached
>
> The WARN suggests that the "default scheduler" is being used. Why is that
> the case even though the adaptive scheduler is configured to be used?
>
> Appreciate it if you can shed some light on why this could be happening.
>
> Thanks
> Chetas
>

Reply via email to