gyfora opened a new pull request, #726: URL: https://github.com/apache/flink-kubernetes-operator/pull/726
## What is the purpose of the change The autoscaler currently doesn't use any GC/HEAP metrics as part of the scaling decisions. While the long term goal may be to support vertical scaling (increasing TM sizes) currently this is out of scope for the autoscaler. However it is very important to detect cases where the throughput of certain vertices or the entire pipeline is critically affected by long GC pauses. In these cases the current autoscaler logic would wrongly assume a low true processing rate and scale the pipeline too high, ramping up costs and causing further issues. Using the improved GC metrics introduced in https://issues.apache.org/jira/browse/FLINK-33318 we should measure the GC pauses and simply block scaling decisions if the pipeline spends too much time garbage collecting and notify the user about the required action to increase memory. *This feature requires Flink 1.19 or the commit back ported to earlier versions* ## Brief change log - *Introduce TM level metrics for the autoscaler and track HEAP/GC usage* - *Trigger event and block scaling if gc is above threshold* - *Tests* ## Verifying this change Unit tests + manual validation in various envs. ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: yes ## Documentation - Does this pull request introduce a new feature? yes - If yes, how is the feature documented? docs [TODO] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org