[ https://issues.apache.org/jira/browse/FLINK-36018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17872402#comment-17872402 ]
Gyula Fora commented on FLINK-36018: ------------------------------------ Hey Guys, I completely agree with what was said above. It makes sense to introduce a new config maybe something like `job.autoscaler.scale-down.minimum-interval` that maybe has a more straightforward meaning compared to the the word lazy. Also we probably don't need new timestamps we can just use the scaling history that is already present. > Support lazy scale down to avoid frequent rescaling > --------------------------------------------------- > > Key: FLINK-36018 > URL: https://issues.apache.org/jira/browse/FLINK-36018 > Project: Flink > Issue Type: Improvement > Components: Autoscaler > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > > {*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent > lags, and make scaling down insensitive to reduce restart frequency. > h1. Background & Motivation > We enabled autoscaler scaling for a few flink production jobs. It works with > Adaptive Scheduler and Rescale api. > Scaling results: > * The recommended parallelism meets expectations most of the time > * When the source traffic increases, the autoscaler scales up the job in > time to prevent lags. > * When the source traffic decreases, the autoscaler scales down job in time > to save resources > * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a > day (job.autoscaler.metrics.window=15 min by default). > As we all know, the job will be unavailable for a while during the restart > for some reasons: > * Cancel job > * Request resources( > [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] > is optimizing it) > * Initialize task > * Restore state > * Catch up lag during restart > * etc > *{color:#de350b}Expectations:{color}* > * Scaling up in time to prevent lags. > * Lazy scaling down to reduce downtime and ensure resources can be released > later. > h1. Solution: > Introduce job.autoscaler.scale-down.lazy-period, the default value could be > 30 min. > Detailed strategies: > * Record the start time of the first scale-down event for each vertex > separately. For example: > ** vertex1: 2024-08-09 01:35:02 > ** vertex2: 2024-08-09 01:38:02 > * Scaling down will be triggered for some cases: > ** Any vertex needs scale up > *** Job restart cannot be avoided, so trigger scale down for another vertex > as well if needed > *** After scale down, clean up the start time of scale-down. > ** The scale down lazy period for any vertex is coming > *** current time - min(start time for each vertex) > scale-down.lazy-period > *** This means that there is no scaling up during the scaling down lazy > period > Note1: If the recommend parallelism >= current parallelism, the start time of > scale-down will be cleaned up for current vertex. > Note2: The recommended parallelism still comes from the latest 15-minute > metrics.For example: > * The current parallelism of vertex1 is 100, the traffic is decreased at > night. > * 2024-08-09 01:00:00, the recommended parallelism is 60. > * > ** The start time of scale down is 2024-08-09 01:00:00. > * 2024-08-09 01:15:00, the recommended parallelism is 50. > ** Still within the range of scale down lazy period. > ** Don't update the start time of scale down. > * 2024-08-09 01:31:00, the recommended parallelism is 40. > ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the > recommended parallelism. > ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 > 01:16:00 to 2024-08-09 01:31:00 -- This message was sent by Atlassian Jira (v8.20.10#820010)