1996fanrui commented on PR #726: URL: https://github.com/apache/flink-kubernetes-operator/pull/726#issuecomment-1853275312
> I feel like the solution to this problem is more complex, but the current code is definitely part of the solution. I'm not sure stopping to scale on GC pressure / heap usage is always desirable. In a lot of scenarios, scaling up might resolve GC pressure / heap usage issues. But after merging this PR, we might get stuck in a bad state. Users do not typically monitor their pipelines events that closely. > > If we discover heap / GC pressure, it looks like we want to let the user know, scale up to solve the issue, and block scaling to a lower parallelism. Not allowing scaling might actually make the situation worse. Sounds make sense! Could we extract a common logic related to unhealthy? And unhealthy includes 2 parts: - How to check one job is unhealthy? Such as: GC severe or memory high is some type of unhealthy. We can introduce more types in the future. - What will autoscaler do when the job is unhealthy? In this PR, our strategy is stopping scaling. And we can revert to the last scaling history or scale up in the future. These 2 parts can continue to be improved in the future. About the switch, I prefer to disable it in this PR, we can enable some switches when we think it's ready for massive production. (At least, disabling memory usage in this PR is fine for me, reason can be found from https://github.com/apache/flink-kubernetes-operator/pull/726#discussion_r1423346604) We don't need to introduce new option to disable it, the `Double.NaN` can be used. When it's NaN, we disable the corresponding switch, WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org