Matthias Pohl created FLINK-37232:
-------------------------------------

             Summary: FLIP-272 breaks some synchronization assumption on the 
AdaptiveScheduler's side
                 Key: FLINK-37232
                 URL: https://issues.apache.org/jira/browse/FLINK-37232
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 2.0-preview, 2.0.0
            Reporter: Matthias Pohl


We noticed some unexpected behavior with the AdaptiveScheduler causing a job to 
reach FAILED state due to {{NoResourceAvailableException}}. The cause was that 
some TaskManager shut down while the job was performing a rescaling operation.

[~chesnay] did a bit of digging and identified an issue with the state 
transition short cut that was introduced in 
[FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
 (ignoring {{WaitingForResources}} when moving from {{Restarting}} to 
{{CreatingExecutionGraph}} as part of the rescale operation.

The cause is that determining the parallelism for triggering the state 
transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is done 
in a single synchronous operation. No TM shutdown event can be processed in 
between. That leads to the {{determineParallelism}} call never failing.

With the FLIP-472 approach, we call determineParallelism twice independently 
from each other:
* When coming up with the rescale decision
* When creating the ExecutionGraph after the job was cancelled.

In between the two operations, anything can happen, i.e. also TM shutdown 
events can be processed. That could lead to the second {{determineParallelism}} 
call in the {{CreatingExecutionGraph}} state transition to fail (due to 
resources not being available) which is not properly handled in the 
{{CreatingExecutionGraph#handleExecutionGraphCreation}}. 

Right now, the expected behavior is that the {{determineParallelism}} call 
succeeds and the subsequent slot allocation might fail. If the slot allocation 
fails, transitioning back to {{WaitingForResources}} is performed.

This behavior can be resolved in two ways:
* Handle the {{NoResourceAvailableException}} in the {{CreatingExecutionGraph}} 
state
* Pass the available VertexParallelism that lead to the rescale decision to the 
{{Restarting}} state and check when the job is cancelled whether that 
parallelism changed. If it didn't change, we could transition to the 
{{CreatingExecutionGraph}}. If it did change in the mean time, we should 
transition to {{WaitingForResources}} and try waiting for the resources in 
another round.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to