[ 
https://issues.apache.org/jira/browse/FLINK-36018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Fan updated FLINK-36018:
----------------------------
    Description: 
{*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent 
lags, and make scaling down insensitive to reduce restart frequency.
h1. Background & Motivation

We enabled autoscaler scaling for a few flink production jobs. It works with 
Adaptive Scheduler and Rescale api.

Scaling results:
 * The recommended parallelism meets expectations most of the time
 * When the source traffic increases, the autoscaler scales up the job in time 
to prevent lags.
 * When the source traffic decreases, the autoscaler scales down job in time to 
save resources
 * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a 
day (job.autoscaler.metrics.window=15 min by default).

As we all know, the job will be unavailable for a while during the restart for 
some reasons:
 * Cancel job
 * Request resources( 
[FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
 is optimizing it)
 * Initialize task
 * Restore state
 * Catch up lag during restart
 * etc

*{color:#de350b}Expectations:{color}*
 * Scaling up in time to prevent lags.
 * Lazy scaling down to reduce downtime and ensure resources can be released 
later.

h1. Solution:
 * Introduce job.autoscaler.scale-down.interval, the default value could be 1 
hour.
 * Replace job.autoscaler.scale-up.grace-period with 
job.autoscaler.scale-down.interval

Detailed strategies:
 * Record the start time of the first scale-down event for each vertex 
separately. For example:
 ** vertex1: 2024-08-09 01:35:02
 ** vertex2: 2024-08-09 01:38:02
 * Scaling down will be triggered for some cases:
 ** Any vertex needs scale up
 *** Job restart cannot be avoided, so trigger scale down for another vertex as 
well if needed
 *** After scale down, clean up the start time of scale-down.
 ** The scale down lazy period for any vertex is coming
 *** current time - min(start time for each vertex) > scale-down.lazy-period
 *** This means that there is no scaling up during the scaling down lazy period

Note1: If the recommend parallelism >= current parallelism, the start time of 
scale-down will be cleaned up for current vertex.

Note2: The recommended parallelism still comes from the latest 15-minute 
metrics.For example:
 * The current parallelism of vertex1 is 100, the traffic is decreased at night.
 * 2024-08-09 01:00:00, the recommended parallelism is 60.
 ** The start time of scale down is 2024-08-09 01:00:00.

 * 2024-08-09 01:15:00, the recommended parallelism is 50.
 ** Still within the range of scale down lazy period.
 ** Don't update the start time of scale down.
 * 2024-08-09 01:31:00, the recommended parallelism is 40.
 ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the 
recommended parallelism.
 ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 
01:16:00 to 2024-08-09 01:31:00

  was:
{*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent 
lags, and make scaling down insensitive to reduce restart frequency.
h1. Background & Motivation

We enabled autoscaler scaling for a few flink production jobs. It works with 
Adaptive Scheduler and Rescale api.

Scaling results:
 * The recommended parallelism meets expectations most of the time
 * When the source traffic increases, the autoscaler scales up the job in time 
to prevent lags.
 * When the source traffic decreases, the autoscaler scales down job in time to 
save resources
 * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a 
day (job.autoscaler.metrics.window=15 min by default).

As we all know, the job will be unavailable for a while during the restart for 
some reasons:
 * Cancel job
 * Request resources( 
[FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
 is optimizing it)
 * Initialize task
 * Restore state
 * Catch up lag during restart
 * etc

*{color:#de350b}Expectations:{color}*
 * Scaling up in time to prevent lags.
 * Lazy scaling down to reduce downtime and ensure resources can be released 
later.

h1. Solution:

Introduce job.autoscaler.scale-down.lazy-period, the default value could be 30 
min.

Detailed strategies:
 * Record the start time of the first scale-down event for each vertex 
separately. For example:
 ** vertex1: 2024-08-09 01:35:02
 ** vertex2: 2024-08-09 01:38:02
 * Scaling down will be triggered for some cases:
 ** Any vertex needs scale up
 *** Job restart cannot be avoided, so trigger scale down for another vertex as 
well if needed
 *** After scale down, clean up the start time of scale-down.
 ** The scale down lazy period for any vertex is coming
 *** current time - min(start time for each vertex) > scale-down.lazy-period
 *** This means that there is no scaling up during the scaling down lazy period

Note1: If the recommend parallelism >= current parallelism, the start time of 
scale-down will be cleaned up for current vertex.

Note2: The recommended parallelism still comes from the latest 15-minute 
metrics.For example:
 * The current parallelism of vertex1 is 100, the traffic is decreased at night.
 * 2024-08-09 01:00:00, the recommended parallelism is 60.

 * 
 ** The start time of scale down is 2024-08-09 01:00:00.
 * 2024-08-09 01:15:00, the recommended parallelism is 50.
 ** Still within the range of scale down lazy period.
 ** Don't update the start time of scale down.
 * 2024-08-09 01:31:00, the recommended parallelism is 40.
 ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the 
recommended parallelism.
 ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 
01:16:00 to 2024-08-09 01:31:00


> Support lazy scale down to avoid frequent rescaling
> ---------------------------------------------------
>
>                 Key: FLINK-36018
>                 URL: https://issues.apache.org/jira/browse/FLINK-36018
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> {*}{color:#de350b}Core idea{color}{*}: Make scaling up sensitive to prevent 
> lags, and make scaling down insensitive to reduce restart frequency.
> h1. Background & Motivation
> We enabled autoscaler scaling for a few flink production jobs. It works with 
> Adaptive Scheduler and Rescale api.
> Scaling results:
>  * The recommended parallelism meets expectations most of the time
>  * When the source traffic increases, the autoscaler scales up the job in 
> time to prevent lags.
>  * When the source traffic decreases, the autoscaler scales down job in time 
> to save resources
>  * {color:#de350b}*Pain point:*{color} Each job rescales more than 20 times a 
> day (job.autoscaler.metrics.window=15 min by default).
> As we all know, the job will be unavailable for a while during the restart 
> for some reasons:
>  * Cancel job
>  * Request resources( 
> [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
>  is optimizing it)
>  * Initialize task
>  * Restore state
>  * Catch up lag during restart
>  * etc
> *{color:#de350b}Expectations:{color}*
>  * Scaling up in time to prevent lags.
>  * Lazy scaling down to reduce downtime and ensure resources can be released 
> later.
> h1. Solution:
>  * Introduce job.autoscaler.scale-down.interval, the default value could be 1 
> hour.
>  * Replace job.autoscaler.scale-up.grace-period with 
> job.autoscaler.scale-down.interval
> Detailed strategies:
>  * Record the start time of the first scale-down event for each vertex 
> separately. For example:
>  ** vertex1: 2024-08-09 01:35:02
>  ** vertex2: 2024-08-09 01:38:02
>  * Scaling down will be triggered for some cases:
>  ** Any vertex needs scale up
>  *** Job restart cannot be avoided, so trigger scale down for another vertex 
> as well if needed
>  *** After scale down, clean up the start time of scale-down.
>  ** The scale down lazy period for any vertex is coming
>  *** current time - min(start time for each vertex) > scale-down.lazy-period
>  *** This means that there is no scaling up during the scaling down lazy 
> period
> Note1: If the recommend parallelism >= current parallelism, the start time of 
> scale-down will be cleaned up for current vertex.
> Note2: The recommended parallelism still comes from the latest 15-minute 
> metrics.For example:
>  * The current parallelism of vertex1 is 100, the traffic is decreased at 
> night.
>  * 2024-08-09 01:00:00, the recommended parallelism is 60.
>  ** The start time of scale down is 2024-08-09 01:00:00.
>  * 2024-08-09 01:15:00, the recommended parallelism is 50.
>  ** Still within the range of scale down lazy period.
>  ** Don't update the start time of scale down.
>  * 2024-08-09 01:31:00, the recommended parallelism is 40.
>  ** Outside of scale-down.lazy-period, trigger rescale, and use 40 as the 
> recommended parallelism.
>  ** The job.autoscaler.metrics.window is 15 min, so metrics from 2024-08-09 
> 01:16:00 to 2024-08-09 01:31:00



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to