[ https://issues.apache.org/jira/browse/FLINK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225290#comment-17225290 ]
Andrey Zagrebin commented on FLINK-19927: ----------------------------------------- True, the recent state handling logic resides in the new SchedulerNG, currently DefaultScheduler. The execution state handling in EG is partially inactive, like the problematic notifyExecutionChange in this issue. We could reconsider how the execution tracking for reconciliation is integrated with the scheduling. I think the tracking logic could be moved from Execution#deploy and EG#notifyExecutionChange to either SchedulerNG#updateTaskExecutionState or DefaultScheduler#deployTaskSafe. The latter looks to me currently more natural. ExecutionVertexOperations.deploy could return submission future for deployment completion in ExecutionDeploymentTracker and Execution#getTerminalFuture to stop the tracking. This would be easier to unit test as well. Nonetheless, this is not a quick fix. The fix, which [~rmetzger] mentions in the issue description, would be quick, I already tried it: * Doing the tracking stop in EG#notifyExecutionChange w/o legacy scheduling check * Testing it in JobMasterExecutionDeploymentReconciliationTest by intercepting the tracking stop in DefaultExecutionDeploymentTracker > ExecutionStateUpdateListener is only updated when legacy scheduling is enabled > ------------------------------------------------------------------------------ > > Key: FLINK-19927 > URL: https://issues.apache.org/jira/browse/FLINK-19927 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Robert Metzger > Assignee: Andrey Zagrebin > Priority: Blocker > Fix For: 1.12.0 > > > This is a finding from FLINK-19805. > The {{ExecutionDeploymentTracker}} is never notified about executions > reaching terminal state, when using the default scheduler. > This can potentially lead to invalid execution reconciliation behavior. > Fixing this ticket probably involves switching the statements here: > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ExecutionGraph.java#L1688-L1692 > As part of the this tickets resolution, I suggest to also introduce a test > case. -- This message was sent by Atlassian Jira (v8.3.4#803005)