Hi Chetas,

The operator logic itself would normally call the rescale api during the
upgrade process, not the autoscaler module. The autoscaler module sets the
correct config with the parallelism overrides, and then the operator
performs the regular upgrade cycle (as when you yourself change something
in the spec). If only the parallelism overrides change then it will use-the
rescale api, otherwise a full upgrade is triggered.

Can you share the entire resource yaml and the logs from the operator
related to the upgrade (after the scaling was triggered)? You can usually
see from the logs why the in-place scaling wasn't used in a particular case.
You can debug in-place scaling itself by completely disabling the
autoscaler and manually setting pipeline.jobvertex-parallelism-overrides in
the flink config.

Cheers,
Gyula

On Thu, May 2, 2024 at 3:49 AM Chetas Joshi <chetas.jo...@gmail.com> wrote:

> Hello,
>
> We recently upgraded the operator to 1.8.0 to leverage the new autoscaling
> features (
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.8/docs/custom-resource/autoscaler/).
> The FlinkDeployment (application cluster) is set to flink v1_18 as well. I
> am able to observe the following event being reported in the logs of the
> operator.
>
> o.a.f.k.o.l.AuditUtils         [INFO ][flink/devpipeline] >>> Event  |
> Info    | SCALINGREPORT   | Scaling execution enabled, begin scaling
> vertices:{ Vertex ID xxxxxxxx | Parallelism 2 -> 1 | Processing capacity
> Infinity -> Infinity | Target data rate 7.85}{ Vertex ID yyyyyyyy |
> Parallelism 2 -> 1 | Processing capacity Infinity -> Infinity | Target data
> rate 0.00}{ Vertex ID zzzzzzzz | Parallelism 2 -> 1 | Processing capacity
> Infinity -> Infinity | Target data rate 7.85}{ Vertex ID wwwwwwwww |
> Parallelism 2 -> 1 | Processing capacity 33235.72 -> 13294.29 | Target data
> rate 6.65}
>
> But the in-place autoscaling is not getting triggered. My understanding is
> that the autoscaler running within the k8s-operator should call the rescale
> api endpoint of the FlinkDeployment (devpipeline)  with a parallelism
> overrides map (vertexId => parallelism) and that should trigger a redeploy
> of the jobGraph. But that is not happening. The restart of the
> FlinkDeployment overrides the map (vertexId => parallelism) in the
> configMap resource that stores the flink-config.
>
> Am I missing something? How do I debug this further?
>
> Here is the flink-config set within the k8s-operator.
>
> job.autoscaler.stabilization.interval: 1m
> job.autoscaler.target.utilization: 0.6
> job.autoscaler.target.utilization.boundary: 0.2
> pipeline.max-parallelism: 180
> jobmanager.scheduler: adaptive
>
>
> Thank you
> Chetas
>

Reply via email to