Matthias Pohl created FLINK-36512:
-------------------------------------

             Summary: Make rescale trigger based on failed checkpoints depend 
on the cause
                 Key: FLINK-36512
                 URL: https://issues.apache.org/jira/browse/FLINK-36512
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 2.0.0
            Reporter: Matthias Pohl


[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
 introduced rescale on checkpoints. The trigger logic is also initiated for 
failed checkpoints (after a counter reached a configurable limit).

The issue here is that we might end up considering failed checkpoints which we 
actually don't want to care about (e.g. checkpoint failures due to not all 
tasks running, yet). Instead, we should start considering checkpoints only if 
the job started running to avoid unnecessary (premature) rescale decisions.

We already have logic like that in place in the 
[CheckpointCoordinator|https://github.com/apache/flink/blob/8be94e6663d8ac6e3d74bf4cd5f540cc96c8289e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java#L217]
 which we might want to use here as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to