[ https://issues.apache.org/jira/browse/FLINK-36018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rui Fan updated FLINK-36018: ---------------------------- Description: {*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent lags, and make scaling down insensitive to reduce restart frequency. h1. Background & Motivation We enabled autoscaler scaling for a few flink production jobs. It works with Adaptive Scheduler and Rescale api. Scaling results: * The recommended parallelism meets expectations most of the time * When the source traffic increases, the autoscaler scales up the job in time to prevent lags. * When the source traffic decreases, the autoscaler scales down job in time to save resources * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a day (job.autoscaler.metrics.window=15 min by default). As we all know, the job will be unavailable for a while during the restart for some reasons: * Cancel job * Request resources( [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] is optimizing it) * Initialize task * Restore state * Catch up lag during restart * etc *{color:#de350b}Expectations:{color}* * Scaling up in time to prevent lags. * Lazy scaling down to reduce downtime and ensure resources can be released later. h1. Solution: * Introduce job.autoscaler.scale-down.interval, the default value could be 1 hour. * Replace job.autoscaler.scale-up.grace-period with job.autoscaler.scale-down.interval Detailed strategies: * Record the start time of the first scale-down event for each vertex separately. For example: ** vertex1: 2024-08-09 01:35:02 ** vertex2: 2024-08-09 01:38:02 * Scaling down will be triggered for some cases: ** Any vertex needs scale up *** Job restart cannot be avoided, so trigger scale down for another vertex as well if needed *** After scale down, clean up the start time of scale-down. ** The scale down lazy period for any vertex is coming *** current time - min(start time for each vertex) > scale-down.lazy-period *** This means that there is no scaling up during the scaling down lazy period Note1: If the recommend parallelism >= current parallelism, the start time of scale-down will be cleaned up for current vertex. Note2: The recommended parallelism still comes from the latest 15-minute metrics.For example: * The current parallelism of vertex1 is 100, the traffic is decreased at night. * 2024-08-09 01:00:00, the recommended parallelism is 60. ** The start time of scale down is 2024-08-09 01:00:00. * 2024-08-09 01:15:00, the recommended parallelism is 50. ** Still within the range of scale down lazy period. ** Don't update the start time of scale down. * 2024-08-09 01:31:00, the recommended parallelism is 40. ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the recommended parallelism. ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 01:16:00 to 2024-08-09 01:31:00 was: {*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent lags, and make scaling down insensitive to reduce restart frequency. h1. Background & Motivation We enabled autoscaler scaling for a few flink production jobs. It works with Adaptive Scheduler and Rescale api. Scaling results: * The recommended parallelism meets expectations most of the time * When the source traffic increases, the autoscaler scales up the job in time to prevent lags. * When the source traffic decreases, the autoscaler scales down job in time to save resources * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a day (job.autoscaler.metrics.window=15 min by default). As we all know, the job will be unavailable for a while during the restart for some reasons: * Cancel job * Request resources( [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] is optimizing it) * Initialize task * Restore state * Catch up lag during restart * etc *{color:#de350b}Expectations:{color}* * Scaling up in time to prevent lags. * Lazy scaling down to reduce downtime and ensure resources can be released later. h1. Solution: Introduce job.autoscaler.scale-down.lazy-period, the default value could be 30 min. Detailed strategies: * Record the start time of the first scale-down event for each vertex separately. For example: ** vertex1: 2024-08-09 01:35:02 ** vertex2: 2024-08-09 01:38:02 * Scaling down will be triggered for some cases: ** Any vertex needs scale up *** Job restart cannot be avoided, so trigger scale down for another vertex as well if needed *** After scale down, clean up the start time of scale-down. ** The scale down lazy period for any vertex is coming *** current time - min(start time for each vertex) > scale-down.lazy-period *** This means that there is no scaling up during the scaling down lazy period Note1: If the recommend parallelism >= current parallelism, the start time of scale-down will be cleaned up for current vertex. Note2: The recommended parallelism still comes from the latest 15-minute metrics.For example: * The current parallelism of vertex1 is 100, the traffic is decreased at night. * 2024-08-09 01:00:00, the recommended parallelism is 60. * ** The start time of scale down is 2024-08-09 01:00:00. * 2024-08-09 01:15:00, the recommended parallelism is 50. ** Still within the range of scale down lazy period. ** Don't update the start time of scale down. * 2024-08-09 01:31:00, the recommended parallelism is 40. ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the recommended parallelism. ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 01:16:00 to 2024-08-09 01:31:00 > Support lazy scale down to avoid frequent rescaling > --------------------------------------------------- > > Key: FLINK-36018 > URL: https://issues.apache.org/jira/browse/FLINK-36018 > Project: Flink > Issue Type: Improvement > Components: Autoscaler > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > > {*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent > lags, and make scaling down insensitive to reduce restart frequency. > h1. Background & Motivation > We enabled autoscaler scaling for a few flink production jobs. It works with > Adaptive Scheduler and Rescale api. > Scaling results: > * The recommended parallelism meets expectations most of the time > * When the source traffic increases, the autoscaler scales up the job in > time to prevent lags. > * When the source traffic decreases, the autoscaler scales down job in time > to save resources > * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a > day (job.autoscaler.metrics.window=15 min by default). > As we all know, the job will be unavailable for a while during the restart > for some reasons: > * Cancel job > * Request resources( > [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] > is optimizing it) > * Initialize task > * Restore state > * Catch up lag during restart > * etc > *{color:#de350b}Expectations:{color}* > * Scaling up in time to prevent lags. > * Lazy scaling down to reduce downtime and ensure resources can be released > later. > h1. Solution: > * Introduce job.autoscaler.scale-down.interval, the default value could be 1 > hour. > * Replace job.autoscaler.scale-up.grace-period with > job.autoscaler.scale-down.interval > Detailed strategies: > * Record the start time of the first scale-down event for each vertex > separately. For example: > ** vertex1: 2024-08-09 01:35:02 > ** vertex2: 2024-08-09 01:38:02 > * Scaling down will be triggered for some cases: > ** Any vertex needs scale up > *** Job restart cannot be avoided, so trigger scale down for another vertex > as well if needed > *** After scale down, clean up the start time of scale-down. > ** The scale down lazy period for any vertex is coming > *** current time - min(start time for each vertex) > scale-down.lazy-period > *** This means that there is no scaling up during the scaling down lazy > period > Note1: If the recommend parallelism >= current parallelism, the start time of > scale-down will be cleaned up for current vertex. > Note2: The recommended parallelism still comes from the latest 15-minute > metrics.For example: > * The current parallelism of vertex1 is 100, the traffic is decreased at > night. > * 2024-08-09 01:00:00, the recommended parallelism is 60. > ** The start time of scale down is 2024-08-09 01:00:00. > * 2024-08-09 01:15:00, the recommended parallelism is 50. > ** Still within the range of scale down lazy period. > ** Don't update the start time of scale down. > * 2024-08-09 01:31:00, the recommended parallelism is 40. > ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the > recommended parallelism. > ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 > 01:16:00 to 2024-08-09 01:31:00 -- This message was sent by Atlassian Jira (v8.20.10#820010)