[jira] [Comment Edited] (FLINK-36531) AutoScaler needs to consider the lag from last checkpoint

Sai Sharath Dandi (Jira) Tue, 15 Oct 2024 18:02:09 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-36531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889879#comment-17889879
 ]


Sai Sharath Dandi edited comment on FLINK-36531 at 10/16/24 12:11 AM:
----------------------------------------------------------------------

[~heigebupahei] I have checked the FLIP and it's exactly what we're looking 
for. We're interested in the future optimization to handle case of large 
checkpoint interval and rescale early than delay the scaling till next 
checkpoint. Since this will be a contribution on the scheduler side rather than 
Autoscaler, I will close this JIRA


was (Author: JIRAUSER298466):
[~heigebupahei] I have checked the FLIP and it's exactly what we're looking 
for. We're interested in the future optimization to handle case of large 
checkpoint interval and rescale early than delay the scaling till next 
checkpoint. I will close this JIRA

> AutoScaler needs to consider the lag from last checkpoint
> ---------------------------------------------------------
>
>                 Key: FLINK-36531
>                 URL: https://issues.apache.org/jira/browse/FLINK-36531
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Sai Sharath Dandi
>            Priority: Major
>
> Autoscaler computes the target processing capacity as 
> [below|https://sg.uberinternal.com/code.uber.internal/uber-code/data-flink-kubernetes-operator@release-1.9-uber/-/blob/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/AutoScalerUtils.java?L47]
> // Target = LAG/CATCH_UP + INPUT_RATE*RESTART/CATCH_UP + 
> INPUT_RATE/TARGET_UTIL
>  
> During the scaling action, the autoscaler will restart the job from the last 
> successful checkpoint, we need to include the number of processed records 
> since last successful checkpoint as part of the lag as those records will be 
> replayed after scaling. This is particularly important for jobs with long 
> checkpoint intervals and large state as there could be a significant 
> difference between the realtime lag and the lag from the checkpoint



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-36531) AutoScaler needs to consider the lag from last checkpoint

Reply via email to