gyfora opened a new pull request, #774: URL: https://github.com/apache/flink-kubernetes-operator/pull/774
## What is the purpose of the change This PR aims to cover both [FLINK-34213](https://issues.apache.org/jira/browse/FLINK-34213) and [FLINK-34266](https://issues.apache.org/jira/browse/FLINK-34266). Currently all metric tracking for input / output data rates, busyTime etc happens based on perSecond metrics in Flink. Depending on the frequency of the autoscaler metric collection and the exact processing pattern of the application perSecond metrics can result in completely erroneous autoscaler metric computations. The most important example of this would be handling large windowed computations or other burst loads where we have spikes in data rates after periods of inactivity. These use-cases currently completely break down with the autoscaler as the jobs are generally scaled too low because incoming data is not measured correctly. To solve this we move from perSecond in/out metrics to the accumulated counts which allows us to measure correct input out rates and output ratios over the entire metric window. To do this we have to revise the metric collection logic as some information is not exposed as metric but exposed directly through the job details query we already do to track topology changes. Summary of changes: - Collect accumulated input/output record count + accumulated busy time from JobDetailsInfo rest request instead of metrics - Remove TrueProcessRate and OutputRatios from the collected metrics and move the computation to the evaluation phase (this reduces the amount of stored metrics in the store as well) - Require at least 2 observations for metric evaluation as needed for rate computation from cumulative metrics - Improve and simplify JobTopology structure to allow incorporating io metrics - As TPR and some other metrics are now only evaluated for the entire metric window. Current values will no longer be reported during evaluation (this is relevant for reported metrics) The above logic changes required a redesign of many of the tests especially the ones with somewhat complex logic. Test changes: - Move many tests from collection to evaluation phase and simplify as much as possible into smaller unit tests - Introduce helper classes to generate metrics for complex integration tests like (MetricsCollectionAndEvaluationTest / BacklogBasedSacalingTest, etc.) - Remove some duplicated or non-functional tests ## Verifying this change A lot of tests and unit tests have changed, I tried to always preserve or extend the coverage. [TODO] : Extensive manual testing still in-progress ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changes to the `CustomResourceDescriptors`: no - Core observer or reconciler logic that is regularly executed: yes ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org