[PR] [FLINK-33764] Track Heap usage and GC pressure to avoid unnecessary scaling [flink-kubernetes-operator]

via GitHub Fri, 08 Dec 2023 06:00:21 -0800


gyfora opened a new pull request, #726:
URL: https://github.com/apache/flink-kubernetes-operator/pull/726


   ## What is the purpose of the change
   
   The autoscaler currently doesn't use any GC/HEAP metrics as part of the 
scaling decisions.
   While the long term goal may be to support vertical scaling (increasing TM 
sizes) currently this is out of scope for the autoscaler.
   
   However it is very important to detect cases where the throughput of certain 
vertices or the entire pipeline is critically affected by long GC pauses. In 
these cases the current autoscaler logic would wrongly assume a low true 
processing rate and scale the pipeline too high, ramping up costs and causing 
further issues.
   
   Using the improved GC metrics introduced in 
https://issues.apache.org/jira/browse/FLINK-33318 we should measure the GC 
pauses and simply block scaling decisions if the pipeline spends too much time 
garbage collecting and notify the user about the required action to increase 
memory.
   
   *This feature requires Flink 1.19 or the commit back ported to earlier 
versions*
   
   ## Brief change log
   
     - *Introduce TM level metrics for the autoscaler and track HEAP/GC usage*
     - *Trigger event and block scaling if gc is above threshold*
     - *Tests*
   
   ## Verifying this change
   
   Unit tests + manual validation in various envs.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
     - Does this pull request introduce a new feature? yes
     - If yes, how is the feature documented? docs [TODO]
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-33764] Track Heap usage and GC pressure to avoid unnecessary scaling [flink-kubernetes-operator]

Reply via email to