[ https://issues.apache.org/jira/browse/FLINK-36512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895509#comment-17895509 ]
Prashant Bhardwaj commented on FLINK-36512: ------------------------------------------- [~mapohl] Yes, I am still interested. Can you please assign it to me? > Make rescale trigger based on failed checkpoints depend on the cause > -------------------------------------------------------------------- > > Key: FLINK-36512 > URL: https://issues.apache.org/jira/browse/FLINK-36512 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Affects Versions: 2.0.0 > Reporter: Matthias Pohl > Priority: Major > > [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] > introduced rescale on checkpoints. The trigger logic is also initiated for > failed checkpoints (after a counter reached a configurable limit). > The issue here is that we might end up considering failed checkpoints which > we actually don't want to care about (e.g. checkpoint failures due to not all > tasks running, yet). Instead, we should start considering checkpoints only if > the job started running to avoid unnecessary (premature) rescale decisions. > We already have logic like that in place in the > [CheckpointCoordinator|https://github.com/apache/flink/blob/8be94e6663d8ac6e3d74bf4cd5f540cc96c8289e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java#L217] > which we might want to use here as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)