XComp commented on pull request #14798: URL: https://github.com/apache/flink/pull/14798#issuecomment-778281634
The test instability is caused by the change adding the requirement that `failureCause` need to be set if the `executionState` is `FAILED`. The failing test runs a task but doesn't wait for the task to be finished. The `TaskExecutor` triggers the cancellation of the `Task` through [Task.failExternally](https://github.com/XComp/flink/blob/354eb574c98bbc5fdaedabc9f3ca7e4acfaa3746/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1129) passing in a `FlinkException: Disconnect from JobManager responsible for [...]`. This will trigger the `Task` to transition into `FAILED`. A context switch happens after the state is set but before the `failureCause` is set. Then, the `Task` finishes in the main thread, cleans up and calls [notifyFinalState](https://github.com/XComp/flink/blob/354eb574c98bbc5fdaedabc9f3ca7e4acfaa3746/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L904) as part of this which initializes a `TaskExecutionState` and run s into the `IllegalStateException` since the `failureCause` is still not set. The solution would be to do the state change and `failureCause` atomically. But I'm not sure whether this would be a performance problem. We would have to switch from [compareAndSet](https://github.com/XComp/flink/blob/354eb574c98bbc5fdaedabc9f3ca7e4acfaa3746/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1064) to a normal lock, I guess. Alternatively, we could remove the invariant again and handle the `null` in the exception history. @tillrohrmann what are your thoughts on that? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org