Rui Fan created FLINK-36018: ------------------------------- Summary: Support lazy scale down to avoid frequent rescaling Key: FLINK-36018 URL: https://issues.apache.org/jira/browse/FLINK-36018 Project: Flink Issue Type: Improvement Components: Autoscaler Reporter: Rui Fan Assignee: Rui Fan
h1. Background & Motivation We enabled autoscaler scaling for a few flink production jobs. It works with Adaptive Scheduler and Rescale api. Scaling results: * The recommended parallelism meets expectations most of the time * When the source traffic increases, the autoscaler scales up the job in time to prevent lags. * When the source traffic decreases, the autoscaler scales down job in time to save resources * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a day (job.autoscaler.metrics.window=15 min by default). As we all know, the job will be unavailable for a while during the restart for some reasons: * Cancel job * Request resources( [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] is optimizing it) * Initialize task * Restore state * Catch up lag during restart * etc *{color:#de350b}Expectations:{color}* * Scaling up in time to prevent lags. * Lazy scaling down to reduce downtime and ensure resources can be released later. h1. Solution: Introduce job.autoscaler.scale-down.lazy-period, the default value could be 30 min. Detailed strategies: * Record the start time of the first scale-down event for each vertex separately. For example: ** vertex1: 2024-08-09 01:35:02 ** vertex2: 2024-08-09 01:38:02 * Scaling down will be triggered for some cases: ** Any vertex needs scale up *** Job restart cannot be avoided, so trigger scale down for another vertex as well if needed *** After scale down, clean up the start time of scale-down. ** The scale down lazy period for any vertex is coming *** current time - min(start time for each vertex) > scale-down.lazy-period *** This means that there is no scaling up during the scaling down lazy period Note1: If the recommend parallelism >= current parallelism, the start time of scale-down will be cleaned up for current vertex. Note2: The recommended parallelism still comes from the latest 15-minute metrics.For example: * The current parallelism of vertex1 is 100, the traffic is decreased at night. * 2024-08-09 01:00:00, the recommended parallelism is 60. ** The start time of scale down is 2024-08-09 01:00:00. * 2024-08-09 01:15:00, the recommended parallelism is 50. ** Still within the range of scale down lazy period. ** Don't update the start time of scale down. * 2024-08-09 01:31:00, the recommended parallelism is 40. ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the recommended parallelism. ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 01:16:00 to 2024-08-09 01:31:00 -- This message was sent by Atlassian Jira (v8.20.10#820010)