[ https://issues.apache.org/jira/browse/FLINK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285937#comment-17285937 ]
Matthias edited comment on FLINK-21376 at 3/2/21, 10:04 AM: ------------------------------------------------------------ For the record: [ErrorInfo:50|https://github.com/apache/flink/blob/c77a686c195d1742c276f4a9e75899c8b85377bb/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/ErrorInfo.java#L50] and [FailureHandlingResult:55|https://github.com/apache/flink/blob/c77a686c195d1742c276f4a9e75899c8b85377bb/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/failover/flip1/FailureHandlingResult.java#L55] can be cleaned up when resolving this issue. was (Author: mapohl): For the record: We could remove the [if statement in Execution.processFail|https://github.com/XComp/flink/blob/8e732bfb2bddc38ec7422f482dcda4be3d296408/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1132] if this issue is resolved. > Failed state might not provide failureCause > ------------------------------------------- > > Key: FLINK-21376 > URL: https://issues.apache.org/jira/browse/FLINK-21376 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.11.3, 1.12.1, 1.13.0 > Reporter: Matthias > Priority: Major > Fix For: 1.13.0 > > > {{Task.executionState}} and {{Task.failureCause}} are not set atomically. > This became an issue when implementing the exception history (FLINK-21187) > where we relied on the invariant that a {{failureCause}} is present when the > {{Task}} failed. > Adding this check to > [Task.notifyFinalStage()|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1001] > will reveal the race condition. > {{TaskExecutorSlotLifetimeTest}} becomes unstable when adding this invariant. > The reason is that the test starts a task but does not wait for the task to > be finished. The [task > finalization|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L895] > and [the cancellation of the > task|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1105] > triggered through stopping the {{TaskManager}} shutdown compete with each > other and could cause the {{executionState}} to be set to {{FAILED}} while > the {{failureCause}} still being {{null}}. This is then forwarded to > {{Execution}} through > [Task.notifyFinalState|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L895]. > We should set {{failureCause}} while setting the {{executionState}} to failed > to not miss any caught error. -- This message was sent by Atlassian Jira (v8.3.4#803005)