[ 
https://issues.apache.org/jira/browse/FLINK-37232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17923772#comment-17923772
 ] 

Matthias Pohl edited comment on FLINK-37232 at 2/5/25 7:25 AM:
---------------------------------------------------------------

master: 
[5b1d0081b47a05fdd67b2ed89e1cc85dff196c73|https://github.com/apache/flink/commit/5b1d0081b47a05fdd67b2ed89e1cc85dff196c73]
2.0: 
[1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c|https://github.com/apache/flink/commit/1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c]

2.0-preview-rc1 was skipped because it's just a milestone version of 2.0


was (Author: mapohl):
master: 
[5b1d0081b47a05fdd67b2ed89e1cc85dff196c73|https://github.com/apache/flink/commit/5b1d0081b47a05fdd67b2ed89e1cc85dff196c73]
2.0: 
[1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c|https://github.com/apache/flink/commit/1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c]

> FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's 
> side
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-37232
>                 URL: https://issues.apache.org/jira/browse/FLINK-37232
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 2.0.0, 2.0-preview
>            Reporter: Matthias Pohl
>            Assignee: Zdenek Tison
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 2.0.0, 2.1.0
>
>
> We noticed some unexpected behavior with the AdaptiveScheduler causing a job 
> to reach FAILED state due to {{NoResourceAvailableException}}. The cause was 
> that some TaskManager shut down while the job was performing a rescaling 
> operation.
> [~chesnay] did a bit of digging and identified an issue with the state 
> transition short cut that was introduced in 
> [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
>  (ignoring {{WaitingForResources}} when moving from {{Restarting}} to 
> {{CreatingExecutionGraph}} as part of the rescale operation.
> The cause is that determining the parallelism for triggering the state 
> transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is 
> done in a single synchronous operation. No TM shutdown event can be processed 
> in between. That leads to the {{determineParallelism}} call never failing.
> With the FLIP-472 approach, we call determineParallelism twice independently 
> from each other:
> * When coming up with the rescale decision
> * When creating the ExecutionGraph after the job was cancelled.
> In between the two operations, anything can happen, i.e. also TM shutdown 
> events can be processed. That could lead to the second 
> {{determineParallelism}} call in the {{CreatingExecutionGraph}} state 
> transition to fail (due to resources not being available) which is not 
> properly handled in the 
> {{CreatingExecutionGraph#handleExecutionGraphCreation}}. 
> Right now, the expected behavior is that the {{determineParallelism}} call 
> succeeds and the subsequent slot allocation might fail. If the slot 
> allocation fails, transitioning back to {{WaitingForResources}} is performed.
> This behavior can be resolved in two ways:
> * Handle the {{NoResourceAvailableException}} in the 
> {{CreatingExecutionGraph}} state
> * Pass the available VertexParallelism that lead to the rescale decision to 
> the {{Restarting}} state and check when the job is cancelled whether that 
> parallelism changed. If it didn't change, we could transition to the 
> {{CreatingExecutionGraph}}. If it did change in the mean time, we should 
> transition to {{WaitingForResources}} and try waiting for the resources in 
> another round.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to