[ 
https://issues.apache.org/jira/browse/FLINK-36512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895509#comment-17895509
 ] 

Prashant Bhardwaj commented on FLINK-36512:
-------------------------------------------

[~mapohl] Yes, I am still interested. Can you please assign it to me?

> Make rescale trigger based on failed checkpoints depend on the cause
> --------------------------------------------------------------------
>
>                 Key: FLINK-36512
>                 URL: https://issues.apache.org/jira/browse/FLINK-36512
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 2.0.0
>            Reporter: Matthias Pohl
>            Priority: Major
>
> [FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
>  introduced rescale on checkpoints. The trigger logic is also initiated for 
> failed checkpoints (after a counter reached a configurable limit).
> The issue here is that we might end up considering failed checkpoints which 
> we actually don't want to care about (e.g. checkpoint failures due to not all 
> tasks running, yet). Instead, we should start considering checkpoints only if 
> the job started running to avoid unnecessary (premature) rescale decisions.
> We already have logic like that in place in the 
> [CheckpointCoordinator|https://github.com/apache/flink/blob/8be94e6663d8ac6e3d74bf4cd5f540cc96c8289e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java#L217]
>  which we might want to use here as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to