Matthias Pohl created FLINK-37232: ------------------------------------- Summary: FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's side Key: FLINK-37232 URL: https://issues.apache.org/jira/browse/FLINK-37232 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 2.0-preview, 2.0.0 Reporter: Matthias Pohl
We noticed some unexpected behavior with the AdaptiveScheduler causing a job to reach FAILED state due to {{NoResourceAvailableException}}. The cause was that some TaskManager shut down while the job was performing a rescaling operation. [~chesnay] did a bit of digging and identified an issue with the state transition short cut that was introduced in [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states] (ignoring {{WaitingForResources}} when moving from {{Restarting}} to {{CreatingExecutionGraph}} as part of the rescale operation. The cause is that determining the parallelism for triggering the state transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is done in a single synchronous operation. No TM shutdown event can be processed in between. That leads to the {{determineParallelism}} call never failing. With the FLIP-472 approach, we call determineParallelism twice independently from each other: * When coming up with the rescale decision * When creating the ExecutionGraph after the job was cancelled. In between the two operations, anything can happen, i.e. also TM shutdown events can be processed. That could lead to the second {{determineParallelism}} call in the {{CreatingExecutionGraph}} state transition to fail (due to resources not being available) which is not properly handled in the {{CreatingExecutionGraph#handleExecutionGraphCreation}}. Right now, the expected behavior is that the {{determineParallelism}} call succeeds and the subsequent slot allocation might fail. If the slot allocation fails, transitioning back to {{WaitingForResources}} is performed. This behavior can be resolved in two ways: * Handle the {{NoResourceAvailableException}} in the {{CreatingExecutionGraph}} state * Pass the available VertexParallelism that lead to the rescale decision to the {{Restarting}} state and check when the job is cancelled whether that parallelism changed. If it didn't change, we could transition to the {{CreatingExecutionGraph}}. If it did change in the mean time, we should transition to {{WaitingForResources}} and try waiting for the resources in another round. -- This message was sent by Atlassian Jira (v8.20.10#820010)