[ https://issues.apache.org/jira/browse/FLINK-36531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sai Sharath Dandi updated FLINK-36531: -------------------------------------- Description: Autoscaler computes the target processing capacity as [below|https://sg.uberinternal.com/code.uber.internal/uber-code/data-flink-kubernetes-operator@release-1.9-uber/-/blob/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/AutoScalerUtils.java?L47] // Target = LAG/CATCH_UP + INPUT_RATE*RESTART/CATCH_UP + INPUT_RATE/TARGET_UTIL During the scaling action, the autoscaler will restart the job from the last successful checkpoint, we need to include the number of processed records since last successful checkpoint as part of the lag as those records will be replayed after scaling. This is particularly important for jobs with long checkpoint intervals and large state as there could be a significant difference between the realtime lag and the lag from the checkpoint was: Autoscaler computes the target processing capacity as [below|https://sg.uberinternal.com/code.uber.internal/uber-code/data-flink-kubernetes-operator@release-1.9-uber/-/blob/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/AutoScalerUtils.java?L47] // Target = LAG/CATCH_UP + INPUT_RATE*RESTART/CATCH_UP + INPUT_RATE/TARGET_UTIL During the scaling action, the autoscaler will start from the last successful checkpoint, we need to include the number of processed records since last successful checkpoint as part of the lag as those records will be replayed after scaling. > AutoScaler needs to consider the lag from last checkpoint > --------------------------------------------------------- > > Key: FLINK-36531 > URL: https://issues.apache.org/jira/browse/FLINK-36531 > Project: Flink > Issue Type: Improvement > Components: Autoscaler > Reporter: Sai Sharath Dandi > Priority: Major > > Autoscaler computes the target processing capacity as > [below|https://sg.uberinternal.com/code.uber.internal/uber-code/data-flink-kubernetes-operator@release-1.9-uber/-/blob/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/AutoScalerUtils.java?L47] > // Target = LAG/CATCH_UP + INPUT_RATE*RESTART/CATCH_UP + > INPUT_RATE/TARGET_UTIL > > During the scaling action, the autoscaler will restart the job from the last > successful checkpoint, we need to include the number of processed records > since last successful checkpoint as part of the lag as those records will be > replayed after scaling. This is particularly important for jobs with long > checkpoint intervals and large state as there could be a significant > difference between the realtime lag and the lag from the checkpoint -- This message was sent by Atlassian Jira (v8.20.10#820010)