Hi Chetas, The operator logic itself would normally call the rescale api during the upgrade process, not the autoscaler module. The autoscaler module sets the correct config with the parallelism overrides, and then the operator performs the regular upgrade cycle (as when you yourself change something in the spec). If only the parallelism overrides change then it will use-the rescale api, otherwise a full upgrade is triggered.
Can you share the entire resource yaml and the logs from the operator related to the upgrade (after the scaling was triggered)? You can usually see from the logs why the in-place scaling wasn't used in a particular case. You can debug in-place scaling itself by completely disabling the autoscaler and manually setting pipeline.jobvertex-parallelism-overrides in the flink config. Cheers, Gyula On Thu, May 2, 2024 at 3:49 AM Chetas Joshi <chetas.jo...@gmail.com> wrote: > Hello, > > We recently upgraded the operator to 1.8.0 to leverage the new autoscaling > features ( > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.8/docs/custom-resource/autoscaler/). > The FlinkDeployment (application cluster) is set to flink v1_18 as well. I > am able to observe the following event being reported in the logs of the > operator. > > o.a.f.k.o.l.AuditUtils [INFO ][flink/devpipeline] >>> Event | > Info | SCALINGREPORT | Scaling execution enabled, begin scaling > vertices:{ Vertex ID xxxxxxxx | Parallelism 2 -> 1 | Processing capacity > Infinity -> Infinity | Target data rate 7.85}{ Vertex ID yyyyyyyy | > Parallelism 2 -> 1 | Processing capacity Infinity -> Infinity | Target data > rate 0.00}{ Vertex ID zzzzzzzz | Parallelism 2 -> 1 | Processing capacity > Infinity -> Infinity | Target data rate 7.85}{ Vertex ID wwwwwwwww | > Parallelism 2 -> 1 | Processing capacity 33235.72 -> 13294.29 | Target data > rate 6.65} > > But the in-place autoscaling is not getting triggered. My understanding is > that the autoscaler running within the k8s-operator should call the rescale > api endpoint of the FlinkDeployment (devpipeline) with a parallelism > overrides map (vertexId => parallelism) and that should trigger a redeploy > of the jobGraph. But that is not happening. The restart of the > FlinkDeployment overrides the map (vertexId => parallelism) in the > configMap resource that stores the flink-config. > > Am I missing something? How do I debug this further? > > Here is the flink-config set within the k8s-operator. > > job.autoscaler.stabilization.interval: 1m > job.autoscaler.target.utilization: 0.6 > job.autoscaler.target.utilization.boundary: 0.2 > pipeline.max-parallelism: 180 > jobmanager.scheduler: adaptive > > > Thank you > Chetas >