XComp commented on pull request #14686: URL: https://github.com/apache/flink/pull/14686#issuecomment-768170545
After talking to @chesnay about the exception history, I had to reiterate over the implementation as I realized that I tried to solve it using wrong assumptions. I try to summarize my conclusions. Let me know if I got something wrong. Processing a state change runs in the main thread, i.e. at least changing the different execution states to `FAILED` does not compete with each other. If a failure happens, all executions of the same pipeline region change their state to `CANCELING`. In the current implementation, no failures are logged for these executions due to the way how [Execution.processFail](https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L1093) is implemented. I need the failure information for the exception history. Hence, I have to start collecting the failure even when the `Execution` is in state `CANCELING` or `CANCELED`. That should be ok as this `failureInfo` is really only used for the exception history (i.e. in the `JobExceptionsHandler`). For the execution history, we have to differentiate between job failures and restartable failures: * Restartable failures are grouped by a key that is based on the set of `Executions` that need to be restarted (using their `ExecutionAttemptId` as a reference). * Job failures are grouped based on the `ExecutionGraph`. There is no unique identifier for `ExecutionGraph`, yet. I'm thinking of extending `AccessExecutionGraph` to add such an identifier. The alternative would be to use the set of `Executions` like it's done for the restartable failures. The problem with that approach is that a restartable failure affecting all `ExecutionVertices` of the `ExecutionGraph` would overlap with a job failure. The exception history is then passed along with the `ArchivedExecutionGraph` to the `JobExceptionsHandler` in something like a `ExecutionGraphInformation` as proposed by @tillrohrmann . `ExecutionGraphCache` and related classes have to adapted accordingly. The `JobExceptionsHandler` will do the translation into a REST response. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org