[ https://issues.apache.org/jira/browse/FLINK-37232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17923772#comment-17923772 ]
Matthias Pohl edited comment on FLINK-37232 at 2/5/25 7:25 AM: --------------------------------------------------------------- master: [5b1d0081b47a05fdd67b2ed89e1cc85dff196c73|https://github.com/apache/flink/commit/5b1d0081b47a05fdd67b2ed89e1cc85dff196c73] 2.0: [1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c|https://github.com/apache/flink/commit/1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c] 2.0-preview-rc1 was skipped because it's just a milestone version of 2.0 was (Author: mapohl): master: [5b1d0081b47a05fdd67b2ed89e1cc85dff196c73|https://github.com/apache/flink/commit/5b1d0081b47a05fdd67b2ed89e1cc85dff196c73] 2.0: [1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c|https://github.com/apache/flink/commit/1ff7433bcca8b3a2792df2e5a3b0b421bfe4ba4c] > FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's > side > ------------------------------------------------------------------------------- > > Key: FLINK-37232 > URL: https://issues.apache.org/jira/browse/FLINK-37232 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 2.0.0, 2.0-preview > Reporter: Matthias Pohl > Assignee: Zdenek Tison > Priority: Blocker > Labels: pull-request-available > Fix For: 2.0.0, 2.1.0 > > > We noticed some unexpected behavior with the AdaptiveScheduler causing a job > to reach FAILED state due to {{NoResourceAvailableException}}. The cause was > that some TaskManager shut down while the job was performing a rescaling > operation. > [~chesnay] did a bit of digging and identified an issue with the state > transition short cut that was introduced in > [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] > (ignoring {{WaitingForResources}} when moving from {{Restarting}} to > {{CreatingExecutionGraph}} as part of the rescale operation. > The cause is that determining the parallelism for triggering the state > transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is > done in a single synchronous operation. No TM shutdown event can be processed > in between. That leads to the {{determineParallelism}} call never failing. > With the FLIP-472 approach, we call determineParallelism twice independently > from each other: > * When coming up with the rescale decision > * When creating the ExecutionGraph after the job was cancelled. > In between the two operations, anything can happen, i.e. also TM shutdown > events can be processed. That could lead to the second > {{determineParallelism}} call in the {{CreatingExecutionGraph}} state > transition to fail (due to resources not being available) which is not > properly handled in the > {{CreatingExecutionGraph#handleExecutionGraphCreation}}. > Right now, the expected behavior is that the {{determineParallelism}} call > succeeds and the subsequent slot allocation might fail. If the slot > allocation fails, transitioning back to {{WaitingForResources}} is performed. > This behavior can be resolved in two ways: > * Handle the {{NoResourceAvailableException}} in the > {{CreatingExecutionGraph}} state > * Pass the available VertexParallelism that lead to the rescale decision to > the {{Restarting}} state and check when the job is cancelled whether that > parallelism changed. If it didn't change, we could transition to the > {{CreatingExecutionGraph}}. If it did change in the mean time, we should > transition to {{WaitingForResources}} and try waiting for the resources in > another round. -- This message was sent by Atlassian Jira (v8.20.10#820010)