Re: [PR] [FLINK-33764] Track Heap usage and GC pressure to avoid unnecessary scaling [flink-kubernetes-operator]

via GitHub Tue, 12 Dec 2023 21:05:36 -0800


1996fanrui commented on PR #726:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/726#issuecomment-1853275312


   > I feel like the solution to this problem is more complex, but the current 
code is definitely part of the solution. I'm not sure stopping to scale on GC 
pressure / heap usage is always desirable. In a lot of scenarios, scaling up 
might resolve GC pressure / heap usage issues. But after merging this PR, we 
might get stuck in a bad state. Users do not typically monitor their pipelines 
events that closely.
   > 
   > If we discover heap / GC pressure, it looks like we want to let the user 
know, scale up to solve the issue, and block scaling to a lower parallelism. 
Not allowing scaling might actually make the situation worse.
   
   Sounds make sense!
   
   Could we extract a common logic related to unhealthy? And unhealthy includes 
2 parts:
   
   - How to check one job is unhealthy? Such as: GC severe or memory high is 
some type of unhealthy. We can introduce more types in the future.
   - What will autoscaler do when the job is unhealthy? In this PR, our 
strategy is stopping scaling. And we can revert to the last scaling history or 
scale up in the future.
   
   These 2 parts can continue to be improved in the future. About the switch, I 
prefer to disable it in this PR, we can enable some switches when we think it's 
ready for massive production. (At least, disabling memory usage in this PR is 
fine for me, reason can be found from 
https://github.com/apache/flink-kubernetes-operator/pull/726#discussion_r1423346604)
   
   We don't need to introduce new option to disable it, the `Double.NaN` can be 
used. When it's NaN, we disable the corresponding switch, WDYT?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [FLINK-33764] Track Heap usage and GC pressure to avoid unnecessary scaling [flink-kubernetes-operator]

Reply via email to