Matthias Pohl created FLINK-36512: ------------------------------------- Summary: Make rescale trigger based on failed checkpoints depend on the cause Key: FLINK-36512 URL: https://issues.apache.org/jira/browse/FLINK-36512 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 2.0.0 Reporter: Matthias Pohl
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler] introduced rescale on checkpoints. The trigger logic is also initiated for failed checkpoints (after a counter reached a configurable limit). The issue here is that we might end up considering failed checkpoints which we actually don't want to care about (e.g. checkpoint failures due to not all tasks running, yet). Instead, we should start considering checkpoints only if the job started running to avoid unnecessary (premature) rescale decisions. We already have logic like that in place in the [CheckpointCoordinator|https://github.com/apache/flink/blob/8be94e6663d8ac6e3d74bf4cd5f540cc96c8289e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java#L217] which we might want to use here as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)