Hi Gyula,

Thanks for getting back and explaining the difference in the
responsibilities of the autoscaler and the operator.

I figured out what the issue was.
Here is what I was trying to do: the autoscaler had initially down-scaled
(2->1) the flinkDeployment so there was
pipeline.jobvertex-parallelism-overrides set to 1 in the flink config
associated with the flinkDeployment. I manually deleted this config from
the config map and restarted the flinkDeployment thus the pipeline started
running with the default parallelism of 2. Now I was expecting the
autoscaler to again down-scale it however it did not happen because the
autoscaler maintains the overrides in a separate configMap called
"autoscaler-<deployment_name>" which had not changed. Thus it was
recognizing that it needs to be down-scaled based on the collected metrics
but did not find the need to override the parallelism in the
"autoscaler-<deployment_name>" configMap because it was still set to the
old value of 1. And hence the operator did not call the rescale api.

Since the restart of the flinkDeployment was triggered manually and not by
the operator, the config managed by the autoscaler got out-of-sync with the
actual flink config of the flinkDeployment and hence the issue.

Cheers,
Chetas

On Wed, May 1, 2024 at 10:21 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi Chetas,
>
> The operator logic itself would normally call the rescale api during the
> upgrade process, not the autoscaler module. The autoscaler module sets the
> correct config with the parallelism overrides, and then the operator
> performs the regular upgrade cycle (as when you yourself change something
> in the spec). If only the parallelism overrides change then it will use-the
> rescale api, otherwise a full upgrade is triggered.
>
> Can you share the entire resource yaml and the logs from the operator
> related to the upgrade (after the scaling was triggered)? You can usually
> see from the logs why the in-place scaling wasn't used in a particular case.
> You can debug in-place scaling itself by completely disabling the
> autoscaler and manually setting pipeline.jobvertex-parallelism-overrides in
> the flink config.
>
> Cheers,
> Gyula
>
> On Thu, May 2, 2024 at 3:49 AM Chetas Joshi <chetas.jo...@gmail.com>
> wrote:
>
>> Hello,
>>
>> We recently upgraded the operator to 1.8.0 to leverage the new
>> autoscaling features (
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.8/docs/custom-resource/autoscaler/).
>> The FlinkDeployment (application cluster) is set to flink v1_18 as well. I
>> am able to observe the following event being reported in the logs of the
>> operator.
>>
>> o.a.f.k.o.l.AuditUtils         [INFO ][flink/devpipeline] >>> Event  |
>> Info    | SCALINGREPORT   | Scaling execution enabled, begin scaling
>> vertices:{ Vertex ID xxxxxxxx | Parallelism 2 -> 1 | Processing capacity
>> Infinity -> Infinity | Target data rate 7.85}{ Vertex ID yyyyyyyy |
>> Parallelism 2 -> 1 | Processing capacity Infinity -> Infinity | Target data
>> rate 0.00}{ Vertex ID zzzzzzzz | Parallelism 2 -> 1 | Processing capacity
>> Infinity -> Infinity | Target data rate 7.85}{ Vertex ID wwwwwwwww |
>> Parallelism 2 -> 1 | Processing capacity 33235.72 -> 13294.29 | Target data
>> rate 6.65}
>>
>> But the in-place autoscaling is not getting triggered. My understanding
>> is that the autoscaler running within the k8s-operator should call the
>> rescale api endpoint of the FlinkDeployment (devpipeline)  with a
>> parallelism overrides map (vertexId => parallelism) and that should trigger
>> a redeploy of the jobGraph. But that is not happening. The restart of the
>> FlinkDeployment overrides the map (vertexId => parallelism) in the
>> configMap resource that stores the flink-config.
>>
>> Am I missing something? How do I debug this further?
>>
>> Here is the flink-config set within the k8s-operator.
>>
>> job.autoscaler.stabilization.interval: 1m
>> job.autoscaler.target.utilization: 0.6
>> job.autoscaler.target.utilization.boundary: 0.2
>> pipeline.max-parallelism: 180
>> jobmanager.scheduler: adaptive
>>
>>
>> Thank you
>> Chetas
>>
>

Reply via email to