[PR] [FLINK-34266] Compute Autoscaler metrics correctly over metric window [flink-kubernetes-operator]

via GitHub Tue, 13 Feb 2024 04:48:40 -0800


gyfora opened a new pull request, #774:
URL: https://github.com/apache/flink-kubernetes-operator/pull/774


   ## What is the purpose of the change
   
   This PR aims to cover both 
[FLINK-34213](https://issues.apache.org/jira/browse/FLINK-34213) and 
[FLINK-34266](https://issues.apache.org/jira/browse/FLINK-34266).
   
   Currently all metric tracking for input / output data rates, busyTime etc 
happens based on perSecond metrics in Flink. Depending on the frequency of the 
autoscaler metric collection and the exact processing pattern of the 
application perSecond metrics can result in completely erroneous autoscaler 
metric computations.
   
   The most important example of this would be handling large windowed 
computations or other burst loads where we have spikes in data rates after 
periods of inactivity. These use-cases currently completely break down with the 
autoscaler as the jobs are generally scaled too low because incoming data is 
not measured correctly.
   
   To solve this we move from perSecond in/out metrics to the accumulated 
counts which allows us to measure correct input out rates and output ratios 
over the entire metric window. To do this we have to revise the metric 
collection logic as some information is not exposed as metric but exposed 
directly through the job details query we already do to track topology changes.
   
   Summary of changes:
   
    - Collect accumulated input/output record count + accumulated busy time 
from JobDetailsInfo rest request instead of metrics
    - Remove TrueProcessRate and OutputRatios from the collected metrics and 
move the computation to the evaluation phase (this reduces the amount of stored 
metrics in the store as well)
    - Require at least 2 observations for metric evaluation as needed for rate 
computation from cumulative metrics
    - Improve and simplify JobTopology structure to allow incorporating io 
metrics 
    - As TPR and some other metrics are now only evaluated for the entire 
metric window. Current values will no longer be reported during evaluation 
(this is relevant for reported metrics) 
   
   The above logic changes required a redesign of many of the tests especially 
the ones with somewhat complex logic.
   
   Test changes:
    - Move many tests from collection to evaluation phase and simplify as much 
as possible into smaller unit tests
    - Introduce helper classes to generate metrics for complex integration 
tests like (MetricsCollectionAndEvaluationTest / BacklogBasedSacalingTest, etc.)
    - Remove some duplicated or non-functional tests 
   
   ## Verifying this change
   
   A lot of tests and unit tests have changed, I tried to always preserve or 
extend the coverage.
   
   [TODO] : Extensive manual testing still in-progress
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-34266] Compute Autoscaler metrics correctly over metric window [flink-kubernetes-operator]

Reply via email to